Closed jthomsen closed 9 years ago
have been investigating the update of the LinkedTV Solr index; luckily I've found that Mathilde provided a readme; I've put the content on the Wiki under http://www.linkedtv.eu/wiki/index.php/Solr and edited it further. To summarize:
Status: the main rbbaktuell shows where both SRT and EXB files exist (around 30) have been indexed successfully (with the above mentioned restrictions). Indexing S&V content gives an NPE call: java -jar TV2Lucene.jar -srt /mnt/data/SV/final/subs/TUSSENKUNST-AVR00006KUO_115000_3091120.srt -exm /mnt/data/SV/final/exmaralda/TUSSENKUNST-AVR00006KUO_115000_3091120.exb -provider SV -videoId 040119d2-76d6-4b8e-a5ce-b08fd380dc87
log output: no tvanytime metadata were given provider: SV videoId: 040119d2-76d6-4b8e-a5ce-b08fd380dc87 http://localhost:8983/solr/SVindex ------- connection to solr server ok ----- exmaralda file processed subtitles processed ------- input file processed ------------- there are 323 concepts java.lang.NullPointerException at fr.eurecom.TV2Lucene.Main.initializeConcepts(Main.java:304) at fr.eurecom.TV2Lucene.Main.main(Main.java:140) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58)
Tagging @benoithuet for having a look
there is a problem with respect to rbb SRT files: we don't get the SRT files together with the MP4 and TVA files; SRT files are provided only as batches, sometimes weekly or even monthly, and sometimes not at all. As for TV2Lucene both parameters -srt and -exm are mandatory we cannot include the indexing in the daily ingestion process So, best would be to update TV2Lucene to make both parameters optional so that the call indexing could be triggered on single file events
If there is no SRT, how can NERD work? Does it mean you do TV2RDF without NERD? It seems to me that this should NOT be the regular process. If we param TV2Lucene the way you indicate, does this mean you will constantly re-update the index when new resources are available?
I totally agree that this should not be the regular process but as a matter of fact this is how it works; the STL files are generated at a different department at rbb in a semi-automatic way, i.e. they are not part of any automatic process but somebody has to think of it and take care that it happens; and as we've discussed with rbb there is currently no chance of changing this (that's at least my last state on this matter). (I've documented the rbb ingestion process now also on http://www.linkedtv.eu/wiki/index.php/LinkedTV_Platform#rbb)
So, as a consequence I currently see only these options (atleast for tv2rdf): 1) either we allow processing of single files on arrival 2) or else we see the existence of the SRTs as mandatory because without NERD it doesn't make any sense and then we trigger tv2rdf only when an SRT is there (but having no prior knowledge when and if this will happen)
As for the Platforn we actually anyway still call tv2rdf only with srt existing for selected videos since we currently first rerun the analysis service for videos of the last month which has been updated to include chapter segmentation (which takes approx. 10 hrs per video).
Concerning tv2lucene we could do both; either update only when SRT is coming in or whenever an EXB or SRT is available
For TV2RDF, I would adopt your solution 2) and wait to have all resources before calling the service. I would also argue that we could do the same for TV2Lucene but indeed, like Tv2RDF is modular, TV2Lucene could also be modular. @benoithuet Any chance we can get a modified version of this module to have this feature? By when?
Among the changes to consider:
Additionally, I would strongly suggest that an update can happen on single file events, i.e. file params are optional, otherwise we can only update on a very irregular basis when all three (SRT, TVA, EXM) are available, whch can be once in a month and even not for all videos
we are now adding 2 updates to the Solr rbb index every day, whenever a new Exmaralda file is available (we're running 2 Exmaralda WP1 requests a day, each takes about 10 hrs to process)
Per today's WP2 telecon, a new TV2Lucene module has been developed with the following features:
trying to index TKK files gives an NPE error, probably in connection with the concepts.txt file: Call:
java -jar TV2Lucene.jar -provider SV -exm /mnt/data/SV/final/exmaralda/TUSSEN_KUNST_-AVR00006KUO_115000_3091120.exb -srt /mnt/data/SV/final/subs/TUSSEN_KUNST_-AVR00006KUO_115000_3091120.srt -videoId 040119d2-76d6-4b8e-a5ce-b08fd380dc87
Log:
TRYING TO PARSE
PARSED
srt: /mnt/data/SV/final/subs/TUSSEN_KUNST_-AVR00006KUO_115000_3091120.srt
exm: /mnt/data/SV/final/exmaralda/TUSSEN_KUNST_-AVR00006KUO_115000_3091120.exb
no tvanytime metadata were given
provider: SV
videoId: 040119d2-76d6-4b8e-a5ce-b08fd380dc87
http://localhost:8983/solr/SVindex
------- connection to solr server ok -----
subtitles processed
------- input file processed -------------
/home/linkedtv/TV2Lucene/script
there are 323 concepts
java.lang.NullPointerException
at fr.eurecom.TV2Lucene.Main.initializeConcepts(Main.java:331)
at fr.eurecom.TV2Lucene.Main.main(Main.java:144)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58)
I tried to reproduce the error in my machine, using the metadata files available in TV2RDF for that particular Media Resource (040119d2-76d6-4b8e-a5ce-b08fd380dc87). I hope they are the newest ones.
The problem is that the corresponding Exmaralda file does not contain any visual concepts. TV2Lucene indexes both transcripts and visual queues and by default is expecting to have them.
Question: Is this something particular for this video or there are many others where LSCOM concepts are missing as well?
thanks for investigating - actually I don't know, could check; however, Lampis is currently working on a new algorithm for TKK videos anyway, so I suggest we discuss this in our WP5 telecon later today
now I tried
java -jar TV2Lucene.jar -provider SV -exm /mnt/data/SV/final/exmaralda/TUSSEN_KUNST_-AVR00006Z6J_115000_2850800.exb -srt /mnt/data/SV/final/subs/TUSSEN_KUNST_-AVR00006Z6J_115000_2850800.srt -videoId 48ee6e56-4747-465b-9ef4-e219720aa113
where ftp://ftp.condat.de/archive/data/SV/final/exmaralda/TUSSENKUNST-AVR00006Z6J_115000_2850800.exb for me seems to include visual concepts, but I get the same Null Pointer Exception. Am I overlooking something?
This new exmaralda file looks exactly the same... no visual concepts. Are you sure we talk about the same document? In more detail:
I got the file from here: (not exactly the same URL you specified, I needed to capitalize d in "data" but I guess this is not relevant) ftp://ftp.condat.de/archive/Data/SV/final/exmaralda/TUSSENKUNST-AVR00006Z6J_115000_2850800.exb
If you take a look in it, there is not "CERTH_Concept-1_all" layer, which is the one TV2Lucene leverages on.
ah, ok thanks, I see the problem, sorry!! will check again.
OK, I've checked again with file 'TUSSENKUNST-AVR00008KC2_115000_2839440.exb' which both contains chapters and visual concepts, and indeed it works! Thanks again, so I will index all TKK files which contain both visual concepts and chapters (as I see it currently about 12); the others will follow as soon as available
ok, obviously we have to delete the existing indexes and then reindex. I tried different ways like
curl http://localhost:8983/solr/#/RBBIndex/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8'
and in the browser, however this doesn't work. Could you please tell me the right way to delete an index here? Thanks!
did investigate a little bit more and came up with the following (the above statement cannot work, because we're using multiple cores here):
http://localhost:8983/solr/admin/cores?action=RELOAD&core=RBBIndex&deleteIndex=true
but I am not sure and don't want to mess up with the index and try this; could you confirm this or correct it, pls? Thanks!!
OK, found also another possibility: stop the Solr Server, delete the files in the data directory and redeploy the SolrServer; will try this
There is a REST API call you can try against the server and should do the job, take a look at: http://stackoverflow.com/questions/7722508/how-to-delete-all-data-from-solr-and-hbase
ok, thanks! but think I will now just delete the /data folders, redeploy/restart and reindex
update the different Solr indexes whenever a new srt, exb, ttl file has been created.