simon987 / sist2

Lightning-fast file system indexer and search tool
GNU General Public License v3.0
848 stars 55 forks source link

Fails while scanning #18

Closed dpieski closed 4 years ago

dpieski commented 4 years ago

sist2: ../../../mce/helper.c:67: mceQNameLevelCleanup: Assertion '0==level || qname_level_set->max_level<level' failed. This occurred after a couple hundred warning lines similar to: image

simon987 commented 4 years ago

I'm uploading version 1.1.14 (docker only), can you let me know if you still get this issue with that version? @dpieski

dpieski commented 4 years ago

It seems to be working now. I am scanning a large directory - it didn'1 fail due to this error today. (lack of memory again in my docker - increasing to 12GB now.)

simon987 commented 4 years ago

 lack of memory again in my docker - increasing to 12GB now

That's very unusual, it shouldn't use more than a few megabytes per thread (maybe a GB each at the very extreme) I'll investigate to see where I introduced memory leaks...

simon987 commented 4 years ago

@dpieski I pushed another release, can you let me know if you still encounter those memory problems? thank you!

dpieski commented 4 years ago

I just got this error now: sist2: ../../../opc/zip.c:1061: opcZipGC: Assertion `header_size<=segment->header_size' failed.

I did not get that before on this same set of files.

simon987 commented 4 years ago

It never ends! Ok, thanks a lot for bearing with me and sorry for the back and forth, this library is not exactly easy to deal with.

I'll try to find a fix later this week :sweat_smile:

dpieski commented 4 years ago

no problem! I completely understand. I really appreciate your work on this. Ultimately, I am hoping to use this at my office to search through a few TB of client files. Since the client files have been accumulating for more than a decade, the folder does give a very wide range of documents to test on....

simon987 commented 4 years ago

@dpieski Please try again with v1.2.3+, thank you!

dpieski commented 4 years ago

Hey @simon987 ! I tried it out. It got a lot further this time. Now I am getting:

Entity: line 2: parser error : attributes construct error
v></c><c r="F20920" s="10" t="s"><v>107710773</v></c></row><row r="2096Descent="
                                                                               ^
Entity: line 2: parser error : Couldn't find end of Start Tag row
v></c><c r="F20920" s="10" t="s"><v>107710773</v></c></row><row r="2096Descent="
simon987 commented 4 years ago

Does the scan step crash right after those errors? Do you mind running it again with --very-verbose and emailing me the file in question @dpieski ? me@simon987.net

simon987 commented 4 years ago

I was able to reproduce that same error code by voluntarily breaking the XML contents of a .docx file.

In v.1.2.4, I made it so those errors will not be displayed (unless --very-verbose is enabled). At this point I can't really do anything else about it (if a file is broken, the best I can so is to ignore it). @dpieski please let me know if you still see those errors using the latest version, thanks!

dpieski commented 4 years ago

Can you capture that error or something? The scan step stops right after that error still. It is the same error as last time. Docker Exits with 139.

simon987 commented 4 years ago

The scan step stops right after that error still.

Oh, sorry. I had assumed that it didn't crash

I'll try to replicate the crash on my side...

simon987 commented 4 years ago

Hi @dpieski,

The newest v.1.2.7 version should be able to gracefully handle XML errors in documents. I had originally assumed that the library I was using would take care of that for me. Please let me know if you still have this issue with that version (and sorry for the long delays :/)

*Note: you will need to re-index with the --force-reset option when upgrading to 1.2.6+ because of a breaking change in the mappings

dpieski commented 4 years ago

@simon987 woot! It is working! Took about 44hrs to scan 2.5mil files totaling 1.7TB. Index size of about 5.9GB. All the errors were caught without breaking sist2. Downside - apparently I slowed down access to the drive for everyone - whoops.

Question though - is there a location I can attach a docker volume to get a log of the output?

Also, kindof off topic for this post, I can make a new issue to discuss if you would rather, but, with the incremental flag, does it pick up modified files or only files that are new? Is there a way to run again to create thumbnails and add the thumbnails to the db? Or can I run again and just tesseract the non-OCR pdfs, or try tesseract on any pdf with no text?

simon987 commented 4 years ago

Took about 44hrs to scan 2.5mil files totaling 1.7TB. Index size of about 5.9GB

Good to hear!. As a side note, the default settings only index the first 16kb of text of every document, you can change the size with --content-size <bytes>. This might significantly increase the index size

is there a location I can attach a docker volume to get a log of the output?

No, you'd need to either "pipe" the logs to a file yourself (i.e. sist2 scan ~ > logs.txt) or don't --rm the container and use the docker logs command

with the incremental flag, does it pick up modified files or only files that are new

The incremental flag will only index files with a different mtime (modification time). This behavior is untested on NTFS but I think that it should work fine

Is there a way to run again to create thumbnails and add the thumbnails to the db?

I'm not sure what you mean by that. By default thumbnails are created and stored for every supported file type.

can I run again and just tesseract the non-OCR pdfs, or try tesseract on any pdf with no text?

If you're talking about the --ocr option, it will attempt to OCR every single image in every .pdf file (with 2.5mil files this could take months).

I'm closing this issue now but feel free to keep discussing here

dpieski commented 4 years ago

As a side note, the default settings only index the first 16kb of text of every document

For a docx file, does that mean the first 16k of the docx or first 16k of the docx>document.xml file? Both? Wondering because there seems to be fewer docx files in search results than I would have anticipated. Did you recently change the default? In the main.c it says the default is 32k btw. (I did not look to see what is actually implemented.)

Re: thumbnails - I did not create the thumbnails the first time through because of size issues. Does it keep the thumbnails in RAM as it scans? Essentially, I was wondering if I could run it again just to generate thumbnails for files already scanned.

Re: OCR - I have about 850k pdf files. If a pdf is only images (eg no selectable text) is that noted somewhere? If so, is there a way I could run OCR on that subgroup of pdfs? That way scanning most of the files is done more quickly, then the OCR can just chug along at its own pace separately.

(I have no idea how your/the OCR stuff works but random thoughts follow) Since the content-size limits the depth into any file, could the OCR just work on 1pg (or some set number of pages) at a time until you get to the content size? Would that speed it up? Or would the size checking slow it down?

simon987 commented 4 years ago

For a docx file, does that mean the first 16k of the docx or first 16k of the docx>document.xml file?

The first 16k of indexed characters (meaning that special characters, whitespace and any other information in the document do not count towards this)

Did you recently change the default? In the main.c it says the default is 32k btw. (I did not look to see what is actually implemented.)

Yes, sorry it's 32k, not 16k

Re: thumbnails - I did not create the thumbnails the first time through because of size issues. Does it keep the thumbnails in RAM as it scans?

Generating thumbnails should not use much more memory. The thumbnails are saved on disk immediately and are not kept in ram.

Essentially, I was wondering if I could run it again just to generate thumbnails for files already scanned.

Unfortunately, there is no way to do that.

Re: OCR - I have about 850k pdf files. If a pdf is only images (eg no selectable text) is that noted somewhere? If so, is there a way I could run OCR on that subgroup of pdfs?

Not currently, no. The app has no way to differentiate a scan or a searchable document right now. I'll try to improve upon the OCR function, but right now, scanning 850k PDFs would take a decade on consumer hardware.

Since the content-size limits the depth into any file, could the OCR just work on 1pg (or some set number of pages) at a time until you get to the content size? Would that speed it up? Or would the size checking slow it down?

I'm not sure I follow what you're saying. Right now, when OCR is enabled, the PDF document is read every image in PDFs is sent one by one to the tesseract-ocr library. As soon the maximum number of characters is reached, the document is closed and it stops reading.

dpieski commented 4 years ago

Awesome! Thanks so much for your response. I think you have answered all my questions now.

If you would like some assistance with documentation, I wouldn't mind helping out to give back a little for all your help.