hbz / lobid

Linking Open Bibliographic Data

https://lobid.org/

Eclipse Public License 2.0

16 stars 4 forks source link

Create testset for full data processing workflow #30

Closed acka47 closed 10 years ago

acka47 commented 10 years ago

~100k resources + items (including resources from api doc page & test files) + examples from github issues

acka47 commented 10 years ago

the whole process over the testset should run in < 1h

dr0i commented 10 years ago

I took 5k docs , transformation will result in 53k different subject URIs (most of them items).

Time consuming:

transformation into ntriples

2 min

Enrichment and converting to json-ld

hbz01-resources:

30 min :-1: collectSubjects
420 min :-1: NtriplesToJson

hbz01-items:

12 sec collectSubjects
90 sec NtriplesToJson

Clearly the bottleneck is the enrichment which takes only place at hbz01-resources. May be worth a look , see lobid/lodmill#331 . For a workaround I propose to do without enrichment for this data test workflow - what do you think @acka47 @fsteeg ?

fsteeg commented 10 years ago

+1 for no enrichment in the testset for now. When we need it, we can add an enrichment testset.

acka47 commented 10 years ago

+1 for starting without enrichment. We can't test the whole UI functionality like this, though.

dr0i commented 10 years ago

Execution of one script (https://github.com/lobid/lodmill/blob/master/lodmill-ld/doc/scripts/processTestHbz01.sh) is enough to start transforming AND indexing. 5k hbz01 resource docs takes 5 m.

Closing.

dr0i commented 10 years ago

Added an gnd test set. Takes just a few seconds more to build everything for test index.

dr0i commented 9 years ago

Note: to don't possibly break a running transformation and because of beeing immediately executed the tests are done on another server not connected with the production hadoop cluster. The test hadoop cluster and everything that's needed for testing resides on gaia.