linkedtv / platform

platform
0 stars 0 forks source link

Trigger Solr update on new file events #33

Closed jthomsen closed 9 years ago

jthomsen commented 10 years ago

update the different Solr indexes whenever a new srt, exb, ttl file has been created.

jthomsen commented 10 years ago

have been investigating the update of the LinkedTV Solr index; luckily I've found that Mathilde provided a readme; I've put the content on the Wiki under http://www.linkedtv.eu/wiki/index.php/Solr and edited it further. To summarize:

jthomsen commented 10 years ago

Status: the main rbbaktuell shows where both SRT and EXB files exist (around 30) have been indexed successfully (with the above mentioned restrictions). Indexing S&V content gives an NPE call: java -jar TV2Lucene.jar -srt /mnt/data/SV/final/subs/TUSSENKUNST-AVR00006KUO_115000_3091120.srt -exm /mnt/data/SV/final/exmaralda/TUSSENKUNST-AVR00006KUO_115000_3091120.exb -provider SV -videoId 040119d2-76d6-4b8e-a5ce-b08fd380dc87

log output: no tvanytime metadata were given provider: SV videoId: 040119d2-76d6-4b8e-a5ce-b08fd380dc87 http://localhost:8983/solr/SVindex ------- connection to solr server ok ----- exmaralda file processed subtitles processed ------- input file processed ------------- there are 323 concepts java.lang.NullPointerException at fr.eurecom.TV2Lucene.Main.initializeConcepts(Main.java:304) at fr.eurecom.TV2Lucene.Main.main(Main.java:140) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58)

rtroncy commented 10 years ago

Tagging @benoithuet for having a look

rtroncy commented 10 years ago

This is also related to Issue 16

jthomsen commented 10 years ago

there is a problem with respect to rbb SRT files: we don't get the SRT files together with the MP4 and TVA files; SRT files are provided only as batches, sometimes weekly or even monthly, and sometimes not at all. As for TV2Lucene both parameters -srt and -exm are mandatory we cannot include the indexing in the daily ingestion process So, best would be to update TV2Lucene to make both parameters optional so that the call indexing could be triggered on single file events

rtroncy commented 10 years ago

If there is no SRT, how can NERD work? Does it mean you do TV2RDF without NERD? It seems to me that this should NOT be the regular process. If we param TV2Lucene the way you indicate, does this mean you will constantly re-update the index when new resources are available?

jthomsen commented 10 years ago

I totally agree that this should not be the regular process but as a matter of fact this is how it works; the STL files are generated at a different department at rbb in a semi-automatic way, i.e. they are not part of any automatic process but somebody has to think of it and take care that it happens; and as we've discussed with rbb there is currently no chance of changing this (that's at least my last state on this matter). (I've documented the rbb ingestion process now also on http://www.linkedtv.eu/wiki/index.php/LinkedTV_Platform#rbb)

So, as a consequence I currently see only these options (atleast for tv2rdf): 1) either we allow processing of single files on arrival 2) or else we see the existence of the SRTs as mandatory because without NERD it doesn't make any sense and then we trigger tv2rdf only when an SRT is there (but having no prior knowledge when and if this will happen)

As for the Platforn we actually anyway still call tv2rdf only with srt existing for selected videos since we currently first rerun the analysis service for videos of the last month which has been updated to include chapter segmentation (which takes approx. 10 hrs per video).

Concerning tv2lucene we could do both; either update only when SRT is coming in or whenever an EXB or SRT is available

rtroncy commented 10 years ago

For TV2RDF, I would adopt your solution 2) and wait to have all resources before calling the service. I would also argue that we could do the same for TV2Lucene but indeed, like Tv2RDF is modular, TV2Lucene could also be modular. @benoithuet Any chance we can get a modified version of this module to have this feature? By when?

rtroncy commented 10 years ago

Among the changes to consider:

jthomsen commented 10 years ago

Additionally, I would strongly suggest that an update can happen on single file events, i.e. file params are optional, otherwise we can only update on a very irregular basis when all three (SRT, TVA, EXM) are available, whch can be once in a month and even not for all videos

jthomsen commented 10 years ago

we are now adding 2 updates to the Solr rbb index every day, whenever a new Exmaralda file is available (we're running 2 Exmaralda WP1 requests a day, each takes about 10 hrs to process)

rtroncy commented 10 years ago

Per today's WP2 telecon, a new TV2Lucene module has been developed with the following features:

jthomsen commented 9 years ago

trying to index TKK files gives an NPE error, probably in connection with the concepts.txt file: Call:

java -jar TV2Lucene.jar  -provider SV -exm /mnt/data/SV/final/exmaralda/TUSSEN_KUNST_-AVR00006KUO_115000_3091120.exb -srt /mnt/data/SV/final/subs/TUSSEN_KUNST_-AVR00006KUO_115000_3091120.srt -videoId 040119d2-76d6-4b8e-a5ce-b08fd380dc87

Log:

TRYING TO PARSE
PARSED
srt: /mnt/data/SV/final/subs/TUSSEN_KUNST_-AVR00006KUO_115000_3091120.srt
exm: /mnt/data/SV/final/exmaralda/TUSSEN_KUNST_-AVR00006KUO_115000_3091120.exb
no tvanytime metadata were given
provider: SV
videoId: 040119d2-76d6-4b8e-a5ce-b08fd380dc87
http://localhost:8983/solr/SVindex
------- connection to solr server ok -----
subtitles processed
------- input file processed -------------
/home/linkedtv/TV2Lucene/script
there are 323 concepts
java.lang.NullPointerException
        at fr.eurecom.TV2Lucene.Main.initializeConcepts(Main.java:331)
        at fr.eurecom.TV2Lucene.Main.main(Main.java:144)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:622)
        at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58)
jluisred commented 9 years ago

I tried to reproduce the error in my machine, using the metadata files available in TV2RDF for that particular Media Resource (040119d2-76d6-4b8e-a5ce-b08fd380dc87). I hope they are the newest ones.

The problem is that the corresponding Exmaralda file does not contain any visual concepts. TV2Lucene indexes both transcripts and visual queues and by default is expecting to have them.

Question: Is this something particular for this video or there are many others where LSCOM concepts are missing as well?

jthomsen commented 9 years ago

thanks for investigating - actually I don't know, could check; however, Lampis is currently working on a new algorithm for TKK videos anyway, so I suggest we discuss this in our WP5 telecon later today

jthomsen commented 9 years ago

now I tried

java -jar TV2Lucene.jar  -provider SV -exm /mnt/data/SV/final/exmaralda/TUSSEN_KUNST_-AVR00006Z6J_115000_2850800.exb -srt /mnt/data/SV/final/subs/TUSSEN_KUNST_-AVR00006Z6J_115000_2850800.srt -videoId 48ee6e56-4747-465b-9ef4-e219720aa113

where ftp://ftp.condat.de/archive/data/SV/final/exmaralda/TUSSENKUNST-AVR00006Z6J_115000_2850800.exb for me seems to include visual concepts, but I get the same Null Pointer Exception. Am I overlooking something?

jluisred commented 9 years ago

This new exmaralda file looks exactly the same... no visual concepts. Are you sure we talk about the same document? In more detail:

I got the file from here: (not exactly the same URL you specified, I needed to capitalize d in "data" but I guess this is not relevant) ftp://ftp.condat.de/archive/Data/SV/final/exmaralda/TUSSENKUNST-AVR00006Z6J_115000_2850800.exb

If you take a look in it, there is not "CERTH_Concept-1_all" layer, which is the one TV2Lucene leverages on.

jthomsen commented 9 years ago

ah, ok thanks, I see the problem, sorry!! will check again.

jthomsen commented 9 years ago

OK, I've checked again with file 'TUSSENKUNST-AVR00008KC2_115000_2839440.exb' which both contains chapters and visual concepts, and indeed it works! Thanks again, so I will index all TKK files which contain both visual concepts and chapters (as I see it currently about 12); the others will follow as soon as available

jthomsen commented 9 years ago

ok, obviously we have to delete the existing indexes and then reindex. I tried different ways like

curl http://localhost:8983/solr/#/RBBIndex/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8'

and in the browser, however this doesn't work. Could you please tell me the right way to delete an index here? Thanks!

jthomsen commented 9 years ago

did investigate a little bit more and came up with the following (the above statement cannot work, because we're using multiple cores here):

http://localhost:8983/solr/admin/cores?action=RELOAD&core=RBBIndex&deleteIndex=true

but I am not sure and don't want to mess up with the index and try this; could you confirm this or correct it, pls? Thanks!!

jthomsen commented 9 years ago

OK, found also another possibility: stop the Solr Server, delete the files in the data directory and redeploy the SolrServer; will try this

jluisred commented 9 years ago

There is a REST API call you can try against the server and should do the job, take a look at: http://stackoverflow.com/questions/7722508/how-to-delete-all-data-from-solr-and-hbase

jthomsen commented 9 years ago

ok, thanks! but think I will now just delete the /data folders, redeploy/restart and reindex