mitodl / lore

Learning Object Repository for Education
https://goo.gl/nDVx4D
GNU Affero General Public License v3.0
16 stars 6 forks source link

course import performance issue #448

Open pwilkins opened 9 years ago

pwilkins commented 9 years ago

This course is loaded into staging.mitx.mit.edu: https://github.mit.edu/mitx/content-mit-801r-2012_Fall When I clone the repo, create a tar.gz of the repo contents, and import it into LORE, it imports as a single item. On staging it renders properly: https://staging.mitx.mit.edu/courses/MITx/8.01r-2012_Fall/2012_Fall/info

pwilkins commented 9 years ago

I was able to use sysadmin/courses on sandbox2o to import this course from its git repo and then export it from studio-sandbox2o. The exported file successfully loaded into LORE so Studio fixed whatever was broken in the git tarball.

tssheth commented 9 years ago

I successfully imported the zip directly from the github repo into lms-staging. Then, I tried to import the course into lore and it resulted as a single course item, as @pwilkins mentioned. I also tried importing the zip directly from the content repo into lore and that had the same result.

This was tested with https://github.mit.edu/edX-Archives/edX-Archive-8.MReVx-2T2014 and https://github.mit.edu/edX-Archives/edX-Archive-8.MechCx-1T2015

http://lore-ci.herokuapp.com/repositories/tej-test/ http://lore-ci.herokuapp.com/repositories/tej-test-02/

carsongee commented 9 years ago

Just to add some to this, I did the 8.01r course locally. It used 100% of the celery worker CPU and took three minutes: celery_1 | [2015-07-30 21:46:32,884: INFO/MainProcess] Task importer.tasks.import_file[cb1c3108-e1c6-4874-95e8-20be72ff3795] succeeded in 185.303660554s: None' After it completed all the items showed up ( 352): image

It only showed up as the course object until that was completed, so I'm guessing your import got interrupted with a push. Particularly because things on lore-ci vs local docker are about 10-20 times slower thus far. I have also started a MechCx import, as I suspect it will have similar behavior, but it is appearing to take quite a bit more time.

carsongee commented 9 years ago

8.MechX imported after: 3507.65911322s in the docker container: image

carsongee commented 9 years ago

In the current release (same system) the import of 8.01 took 25.558808708s (7x faster). and MechCx took: 174.744378279s (20x speed up). I think we will probably want to figure this out before releasing.

giocalitri commented 9 years ago

My initial theory was that the problem was caused by the fact that we use RealtimeSignalProcessor for haystack to trigger updates to elasticsearch.

This settings works great during the normal use of the website, but it's not ideal during the import, when it is maybe better to do batch indexes. I found this that describes an idea of how to approach the problem.

Then I started to make some tests and I imported a course (8.MechCx-1T2015) locally on my docker.

At the beginning every single request to elasticsearch was taking more or less 0.5s (yes, half second), then without any reason (I was at lunch at the time) it started to take few milliseconds per request.

More in detail: This is when celery started to import:

celery_1  | [2015-07-31 16:07:57,227: INFO/Worker-6] Starting new HTTP connection (1): elastic
celery_1  | [2015-07-31 16:07:57,237: INFO/Worker-6] GET http://elastic:9200/haystack/_mapping [status:200 request:0.010s]
celery_1  | [2015-07-31 16:07:57,237: WARNING/Worker-6] No handlers could be found for logger "elasticsearch.trace"
celery_1  | [2015-07-31 16:07:57,241: INFO/Worker-6] PUT http://elastic:9200/haystack [status:400 request:0.004s]
celery_1  | [2015-07-31 16:07:57,252: INFO/Worker-6] PUT http://elastic:9200/haystack/_mapping/modelresult [status:200 request:0.010s]
elastic_1 | [2015-07-31 16:07:58,237][INFO ][cluster.metadata         ] [Isis] [haystack] update_mapping [modelresult] (dynamic)
celery_1  | [2015-07-31 16:07:58,274: INFO/Worker-6] POST http://elastic:9200/haystack/modelresult/_bulk [status:200 request:0.875s]
celery_1  | [2015-07-31 16:07:58,573: INFO/Worker-6] POST http://elastic:9200/haystack/_refresh [status:200 request:0.297s]
celery_1  | [2015-07-31 16:08:01,032: INFO/Worker-6] POST http://elastic:9200/haystack/modelresult/_bulk [status:200 request:0.616s]
celery_1  | [2015-07-31 16:08:01,180: INFO/Worker-6] POST http://elastic:9200/haystack/_refresh [status:200 request:0.144s]
celery_1  | [2015-07-31 16:08:01,890: INFO/Worker-6] POST http://elastic:9200/haystack/modelresult/_bulk [status:200 request:0.610s]

This is the moment when the requests start to become faster:

celery_1  | [2015-07-31 16:45:44,761: INFO/Worker-6] POST http://elastic:9200/haystack/modelresult/_bulk [status:200 request:0.574s]
celery_1  | [2015-07-31 16:45:44,816: INFO/Worker-6] POST http://elastic:9200/haystack/_refresh [status:200 request:0.053s]
celery_1  | [2015-07-31 16:45:45,494: INFO/Worker-6] POST http://elastic:9200/haystack/modelresult/_bulk [status:200 request:0.585s]
celery_1  | [2015-07-31 16:45:45,572: INFO/Worker-6] POST http://elastic:9200/haystack/_refresh [status:200 request:0.076s]
celery_1  | [2015-07-31 16:45:45,591: INFO/Worker-6] POST http://elastic:9200/haystack/modelresult/_bulk [status:200 request:0.010s]
celery_1  | [2015-07-31 16:45:45,598: INFO/Worker-6] POST http://elastic:9200/haystack/_refresh [status:200 request:0.006s]
celery_1  | [2015-07-31 16:45:45,613: INFO/Worker-6] POST http://elastic:9200/haystack/modelresult/_bulk [status:200 request:0.004s]
celery_1  | [2015-07-31 16:45:45,620: INFO/Worker-6] POST http://elastic:9200/haystack/_refresh [status:200 request:0.007s]

And this is at the end:

celery_1  | [2015-07-31 16:50:30,576: INFO/Worker-6] POST http://elastic:9200/haystack/modelresult/_bulk [status:200 request:0.002s]
celery_1  | [2015-07-31 16:50:30,579: INFO/Worker-6] POST http://elastic:9200/haystack/_refresh [status:200 request:0.003s]
celery_1  | [2015-07-31 16:50:30,587: INFO/Worker-6] POST http://elastic:9200/haystack/modelresult/_bulk [status:200 request:0.001s]
celery_1  | [2015-07-31 16:50:30,595: INFO/Worker-6] POST http://elastic:9200/haystack/_refresh [status:200 request:0.007s]
celery_1  | [2015-07-31 16:50:30,776: INFO/MainProcess] Task importer.tasks.import_file[d664575d-b514-4e83-b947-89b8c8da84cc] succeeded in 2563.47286971s: None

I am not sure what can make the behavior change so drastically, but I am pretty sure that the problem is related to this.

carsongee commented 9 years ago

I'm not quite sure how the import goes, but I'm guessing the slower POSTs are just from posting larger content structures. i.e. the course block is upwards of 4MiBs, but once you get down to ingesting problems they are usually <1KiB

giocalitri commented 9 years ago

That was the first idea me and Brandon had on Friday, but the problem is that all the following attempts to import the same large course or other large courses after the first one are immediately fast from the beginning. Anyway, I'm not sure this is relevant, but I would look at the celery or elasticsearch logs on heroku to see if there is a similar issue.

carsongee commented 9 years ago

So my original test was a bit wrong. It turns out when I thought I was testing against the latest release, I was actually testing against v0.4.0 instead of v0.5.0. I'll work on verifying it, but I think it means there isn't any additional slowdown between v0.5 and v0.6, and that the most likely culprit is additional parsing of assets on import. We may look at optimizing there after I verify, but we should be good for release

carsongee commented 9 years ago

I confirmed this by running bisect from v0.4.0 to v0.5.0:

ed599cd6d183ab3286011d16febd6d9d7e1be087 is the first bad commit
commit ed599cd6d183ab3286011d16febd6d9d7e1be087
Author: Shawn Milochik <shawn@milochik.com>
Date:   Wed Jul 15 04:13:47 2015 -0400

    Added static file parsing to HTML elements.

    Closes #231.

:040000 040000 6fc8f56d6d13a1ca54114ac8cc91eb765c760337 8437fb90c1cf40b96ee2867dae5835e9546c3a85 M      importer
:040000 040000 918a1ad71413ebe195643ab2b471a72e6d7c5fd3 75254c2b40ac13f190ad0230bb4642123974cba4 M      learningresources
:100644 100644 09a33ee25076478ad84ee810e29f76ef6335be10 346e19e398d50bf4bcd7cce791cd2b0e39e49037 M      requirements.txt
:040000 040000 76e0f05a68ba4149a78edfbed4c9f854053bcabf 2ac993f672365ef158816a22710d4171ebc9cac2 M      rest
:040000 040000 85b0a1f9bf88e54c9c4a165d5c93328379a72a36 71feac702805e94f65bb810d0284b68b541091e9 M      ui

which makes a lot of sense since we added a ton of xml parsing with that commit.