lintool / twitter-tools

Twitter Tools
twittertools.cc
218 stars 100 forks source link

MalformedJsonException forced end to indexing #34

Closed isoboroff closed 11 years ago

isoboroff commented 11 years ago

Was running the indexer at HEAD in trec2013-api over the weekend on my version of the 2013 crawl. I struck an odd exception below.

13/05/11 04:24:47 INFO indexing.IndexStatuses: 173700000 statuses indexed com.google.gson.JsonSyntaxException: com.google.gson.stream.MalformedJsonException: invalid number or unquoted string near Stream closed. at com.google.gson.Streams.parse(Streams.java:51) at com.google.gson.JsonParser.parse(JsonParser.java:83) at com.google.gson.JsonParser.parse(JsonParser.java:58) at com.google.gson.JsonParser.parse(JsonParser.java:44) at cc.twittertools.corpus.data.Status.fromJson(Status.java:112) at cc.twittertools.corpus.data.JsonStatusBlockReader.next(JsonStatusBlockReader.java:44) at cc.twittertools.corpus.data.JsonStatusCorpusReader.next(JsonStatusCorpusReader.java:48) at cc.twittertools.search.indexing.IndexStatuses.main(IndexStatuses.java:138) Caused by: com.google.gson.stream.MalformedJsonException: invalid number or unquoted string near Stream closed. at com.google.gson.stream.JsonReader.syntaxError(JsonReader.java:1110) at com.google.gson.stream.JsonReader.decodeLiteral(JsonReader.java:1100) at com.google.gson.stream.JsonReader.peek(JsonReader.java:343) at com.google.gson.Streams.parse(Streams.java:38) ... 7 more com.google.gson.JsonSyntaxException: com.google.gson.stream.MalformedJsonException: invalid number or unquoted string near Stream closed. at com.google.gson.Streams.parse(Streams.java:51) at com.google.gson.JsonParser.parse(JsonParser.java:83) at com.google.gson.JsonParser.parse(JsonParser.java:58) at com.google.gson.JsonParser.parse(JsonParser.java:44) at cc.twittertools.corpus.data.Status.fromJson(Status.java:112) at cc.twittertools.corpus.data.JsonStatusBlockReader.next(JsonStatusBlockReader.java:44) at cc.twittertools.corpus.data.JsonStatusCorpusReader.next(JsonStatusCorpusReader.java:48) at cc.twittertools.search.indexing.IndexStatuses.main(IndexStatuses.java:138) Caused by: com.google.gson.stream.MalformedJsonException: invalid number or unquoted string near Stream closed. at com.google.gson.stream.JsonReader.syntaxError(JsonReader.java:1110) at com.google.gson.stream.JsonReader.decodeLiteral(JsonReader.java:1100) at com.google.gson.stream.JsonReader.peek(JsonReader.java:343)

I haven't tracked this further to find the bundle it barfed on. I think this might be happening if we have a malformed tweet at the end of a block, but I don't see why that should happen. It left me an index after the crash, so I'll see if I can't make a test case.

milesefron commented 11 years ago

Ian, Let me know if you want me to check this out. Otherwise I'll wait to hear if you find anything regarding a test example.

isoboroff commented 11 years ago

I may not get back to it until late this week or early next, so if you can and want to poke, pls go for it.

isoboroff commented 11 years ago

Managed to slide the partial index into a Solr instance (since Luke is AWOL as of two minor Lucene versions ago ;-( Here is the last document indexed:

<response><lst name="responseHeader"><int name="status">0</int><int name="QTime">1662</int></lst><lst name="index"><int name="numDocs">178971335</int><int name="maxDoc">178971335</int><int name="deletedDocs">0</int><long name="version">19115</long><int name="segmentCount">27</int><bool name="current">true</bool><bool name="hasDeletions">false</bool><str name="directory">org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(org.apache.lucene.store.NIOFSDirectory@/Volumes/Data/solr-4.2.1/example/solr/collection1/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@7e140bf; maxCacheMB=48.0 maxMergeSizeMB=4.0)</str><lst name="userData"/></lst><lst name="doc"><int name="docId">178971334</int><lst name="lucene"><lst name="id"><str name="type">string</str><str name="schema">I-S-----OF-----l</str><str name="flags">-TS-------------</str><str name="value">311588095554371584</str><str name="internal">311588095554371584</str><float name="boost">1.0</float><int name="docFreq">0</int></lst><lst name="epoch"><null name="type"/><str name="schema">----------------</str><str name="flags">-TS-------------</str><null name="value"/><str name="internal">1363123365</str><float name="boost">1.0</float><int name="docFreq">0</int></lst><lst name="screen_name"><null name="type"/><str name="schema">----------------</str><str name="flags">ITS-------------</str><null name="value"/><str name="internal">Hanoudie</str><float name="boost">1.0</float><int name="docFreq">0</int></lst><lst name="text"><str name="type">text_general</str><str name="schema">IT--M-----------</str><str name="flags">ITS-------------</str><str name="value">روحي مَ هي ناقصهه يومك تعنيهاا</str><str name="internal">روحي مَ هي ناقصهه يومك تعنيهاا</str><float name="boost">1.0</float><int name="docFreq">0</int></lst><lst name="retweet_count"><null name="type"/><str name="schema">----------------</str><str name="flags">-TS-------------</str><null name="value"/><str name="internal">0</str><float name="boost">1.0</float><int name="docFreq">0</int></lst><lst name="friends_count"><null name="type"/><str name="schema">----------------</str><str name="flags">-TS-------------</str><null name="value"/><str name="internal">326</str><float name="boost">1.0</float><int name="docFreq">0</int></lst><lst name="followers_count"><null name="type"/><str name="schema">----------------</str><str name="flags">-TS-------------</str><null name="value"/><str name="internal">125</str><float name="boost">1.0</float><int name="docFreq">0</int></lst><lst name="statuses_count"><null name="type"/><str name="schema">----------------</str><str name="flags">-TS-------------</str><null name="value"/><str name="internal">9795</str><float name="boost">1.0</float><int name="docFreq">0</int></lst></lst><doc name="solr"><str name="id">311588095554371584</str><str name="epoch">1363123365</str><str name="screen_name">Hanoudie</str><arr name="text"><str>روحي مَ هي ناقصهه يومك تعنيهاا</str></arr><str name="retweet_count">0</str><str name="friends_count">326</str><str name="followers_count">125</str><str name="statuses_count">9795</str></doc></lst><lst name="info"><lst name="key"><str name="I">Indexed</str><str name="T">Tokenized</str><str name="S">Stored</str><str name="D">DocValues</str><str name="M">Multivalued</str><str name="V">TermVector Stored</str><str name="o">Store Offset With TermVector</str><str name="p">Store Position With TermVector</str><str name="O">Omit Norms</str><str name="F">Omit Term Frequencies & Positions</str><str name="P">Omit Positions</str><str name="H">Store Offsets with Positions</str><str name="L">Lazy</str><str name="B">Binary</str><str name="f">Sort Missing First</str><str name="l">Sort Missing Last</str></lst><str name="NOTE">Document Frequency (df) is not updated when a document is marked for deletion.  df values include deleted documents.</str></lst></response>
isoboroff commented 11 years ago

I can't reproduce this one anymore. Closing.