tools4j / stacked-off

Offline StackExchange indexer and search engine
MIT License
77 stars 8 forks source link

Hello #1

Closed idbxy closed 5 years ago

idbxy commented 5 years ago

I get an error when reaching 8GB out of 18GB of parsing the comments.xml

https://i.imgur.com/9AdCzm4.png

What's the cause of this? Is it possible to answer this quickly, I need it in 2 days before I leave

benjwarner commented 5 years ago

Hi there,

I haven't seen that error before.

Unfortunately am camping at the moment, the earliest I can look at it is Tuesday.

The only thing I can think of trying would be to try downloading the previous stack dump to the one you were trying. (I realize that's quite a hassle).

Cheers, Ben

On Fri, 23 Aug 2019, 12:43 idbxy, notifications@github.com wrote:

I get an error when reaching 8GB out of 18GB of parsing the comments.xml

https://i.imgur.com/9AdCzm4.png

What's the cause of this? Is it possible to answer this quickly, I need it in 2 days before I leave

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tools4j/stacked-off/issues/1?email_source=notifications&email_token=AANM327F4QEYKKUFKVIJP7DQF7EPVA5CNFSM4IO6YKUKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HHAV3PA, or mute the thread https://github.com/notifications/unsubscribe-auth/AANM32ZTSON4AYZIXAUV2WLQF7EPVANCNFSM4IO6YKUA .

idbxy commented 5 years ago

So I downloaded the previous year stack dump and had the same issue again

at org.tools4j.stacked.index.FileInZipParser.start(SeZipFileParser.kt:189) at org.tools4j.stacked.index.ExtractCallback$getStream$2.run(SeZipFileParser.kt:104) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: org.tools4j.stacked.index.XmlFileParserException: Write end dead child number [52986815] at org.tools4j.stacked.index.XmlFileParser.parseElements(XmlFileParser.kt:64) at org.tools4j.stacked.index.XmlFileParser.parse(XmlFileParser.kt:20) at org.tools4j.stacked.index.FileInZipParser.start(SeZipFileParser.kt:187) ... 6 more Caused by: com.ctc.wstx.exc.WstxIOException: Write end dead at com.ctc.wstx.sr.StreamScanner.constructFromIOE(StreamScanner.java:640) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:1004) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:1043) at com.ctc.wstx.sr.StreamScanner.getNextChar(StreamScanner.java:789) at com.ctc.wstx.sr.BasicStreamReader.parseAttrValue(BasicStreamReader.java:1973) at com.ctc.wstx.sr.BasicStreamReader.handleNsAttrs(BasicStreamReader.java:3145) at com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:3043) at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2919) at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1123) at org.codehaus.stax2.ri.Stax2EventReaderImpl.nextEvent(Stax2EventReaderImpl.java:255) at org.tools4j.stacked.index.XmlFileParser.parseElements(XmlFileParser.kt:37) ... 8 more Caused by: java.io.IOException: Write end dead at java.io.PipedInputStream.read(Unknown Source) at java.io.PipedInputStream.read(Unknown Source) at com.ctc.wstx.io.BaseReader.readBytes(BaseReader.java:155) at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:369) at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:112) at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:89) at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:998) ... 17 more 21:31:00.380 [Thread-15] ERROR org.tools4j.stacked.index.SeDirParser - Write end dead child number [52986815] in file [Comments.xml] whilst parsing archive [D:\StackOverflow\Website\stackoverflow.com-Comments.7z]

idbxy commented 5 years ago

is this quickly solve able on my end by monday? Would REALLY appreciate it if you could look into it, it's needed ASAP and very important

benjwarner commented 5 years ago

No, it won't be fixed by Monday.

On Sat, 24 Aug 2019, 20:50 idbxy, notifications@github.com wrote:

is this quickly solve able on my end by monday?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tools4j/stacked-off/issues/1?email_source=notifications&email_token=AANM325W3NLZZ5TXXF7S3K3QGGGGZA5CNFSM4IO6YKUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5CGLIQ#issuecomment-524576162, or mute the thread https://github.com/notifications/unsubscribe-auth/AANM325FZ626LDISPMK6AILQGGGGZANCNFSM4IO6YKUA .

idbxy commented 5 years ago

Is there anything else I could try? or do you have any other options how to use stack overflow offline that you know off?

benjwarner commented 5 years ago

Not that I can think of, without looking at it further. Am camping right now so no computer access.

You could try using the other stack dump tool that is available.

On Sat, 24 Aug 2019, 21:06 idbxy, notifications@github.com wrote:

Is there anything else I could try?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tools4j/stacked-off/issues/1?email_source=notifications&email_token=AANM326ZQ2SBRTUHLFVPMJDQGGIDZA5CNFSM4IO6YKUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5CGRWY#issuecomment-524576987, or mute the thread https://github.com/notifications/unsubscribe-auth/AANM32ZPXT7D3QXLIIA3PJ3QGGIDZANCNFSM4IO6YKUA .

idbxy commented 5 years ago

Hello,

I moved the stacked-off folder to the C drive (primary drive) instead of the D drive, and moved the website folder (the rar, 7z data dumps) inside the stacked-off folder where the bin and lib folders are.

Had the idea after looking to some other similar projects that said to store something on the primary drive instead of other drives, and I thought it was worth giving a shot

previously it was D:\Foldername\Website D:\Foldername\StackedOff\Bin+Lib Now it's C:\StackedOff\Bin+Lib C:\StackedOff\Website

Until now this has resolved the issue and not having the dead child error as previously and I'm almost done indexing stack overflow

If I don't respond again, it means this solved the issue and I wanted you to know how.

so it might be an idea to change this line in the readme

Download the latest zip version from here, and unzip into your desired location.

into

Download the latest zip version from here, and unzip into your desired location on your primary drive (usually the C drive)

idbxy commented 5 years ago

okay to confirm

it's fixed that way

any way to search with tags? like c++ tag only

thanks!

benjwarner commented 5 years ago

Ok, thanks for letting me know.

There's no way to specifically search for tags. But tags are included when u do a search. So if u search for c++ it should include questions which have c++ tag.

If for some reason c++ search isn't looking right, try using double quotes, eg "c++"

On Sun, 25 Aug 2019, 15:56 idbxy, notifications@github.com wrote:

okay to confirm

it's fixed that way

any way to search with tags? like c++ tag only

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tools4j/stacked-off/issues/1?email_source=notifications&email_token=AANM326ADFYQMNXTVYTRTTLQGKMQHA5CNFSM4IO6YKUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5CVDEY#issuecomment-524636563, or mute the thread https://github.com/notifications/unsubscribe-auth/AANM324OYBNM25D3YU5JOL3QGKMQHANCNFSM4IO6YKUA .