opensemanticsearch / open-semantic-search

Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)
https://opensemanticsearch.org
GNU General Public License v3.0
971 stars 169 forks source link

"Import Status: Running file import" stuck. #282

Open ZeroCool940711 opened 4 years ago

ZeroCool940711 commented 4 years ago

Seems like OpenSemanticSearch is stuck extracting and analyzing some files, it's been more than a few days and its still showing the same message when searching, even after rebooting it still stuck on the same files. It doesn't seem to be indexing anything new as the total document count still the same as it was before and there doesn't seem to be anything else OSS is doing.

image

wAikAp commented 4 years ago

Same, wait for a long time but seems not working.. and flower have no active session

ZeroCool940711 commented 4 years ago

I think after some time it just stops working, in my case after 75 billion documents it doesnt index anything or process anything even though the CPU and RAM is not been used at all in my server, seems like there is some internal limit or something is broken, nothing is logged so its hard to tell what's going on.

wAikAp commented 4 years ago

But I just indexing 6 files, seems 1 .ppt file can't do the OCR task, and I wait for 2 days, the import status still " Running file import (still 1 documents to process) "

srich commented 4 years ago

I am also experiencing this issue testing out open-semantic-search 20.02.08. Is there a service that needs to be restarted, or how is this issue resolved?

Adding start/stop instructions for services in addition to "solr" will be helpful...as well as the order of operations. https://www.opensemanticsearch.org/doc/admin/cmd

DetlevCM commented 4 years ago

Is it slow or is it stuck? - I set up a new instance on a laptop (with really too little RAM, so there will be a lot of swapping), and it seemed stuck on 3 files. After maybe 2 days it was down to 2 files. So not stuck, but slow due to swapping...

Though I will also say that the User Interface is not ideal as it would be nice to know which files are missing...

ZeroCool940711 commented 4 years ago

In my case its completely stuck, its not a RAM problem as it has a lot of RAM on the server im running it, I think it might have something to do with images been deleted before it can process them, if im right images are not downloaded to the server but instead they are used directly from the website where they were indexed, so, could be that an image was deleted or moved before it could be processed, also could be that it doesn't have access to the image or something, it could be trying the same files over and over and because they are not accessible the process can not be completed.

olli0815 commented 4 years ago

I do have the same problem: Indexed a small folder via "opensemanticsearch-index-dir" leads to message "Running file import (still 77 documents to process)". CLI shows Indexing new file: ....but index creation seems to stuck. The folder does only contain simple textfiles w/o any images.

Any hints to get the root cause? logs?

Edit# indexing a single file with opensemanticsearch-index-file within the same folder is running fine.

mbanks850 commented 4 years ago

Mine is similar, it has looked this way since February, there have been a bunch a reboots and crashes. I am running the 20.01.17 release. I was thinking of downloading 20.04.17 to see if it made any difference.

It would be nice if there was a timeout, have it skip the current document, and move on. Let it come back to it on the next pass

Import status: Running file import (still 5071601 documents to process)

Because of yet running and open tasks like text extraction and analysis maybe not all results were found yet, since at the moment of this search 5071601 file(s) could be only searched, overviewed and filtered by their file names only, not yet by their content and/or content based facets/filters!

 Previous Newest 10 of 5339085 documents 
DetlevCM commented 4 years ago

If anybody wants to do some testing, I wonder if the problem does not stem from an interaction of components (it might be too early to tell just now on my end):

I decided to "clean up" and start with a new freshly configured instance of OpenSemantic Search. (Side note: after updating Debian, I immediately had some corrupted files in /var/lib/dpkg/info ... - I wonder why and how.)

In order to reduce the computational cost and also because I am not sure it adds value in my specific use case, I disabled both the Named Entity Recognition (Spacy) and the Graph DB (neo4j). So far it seems that the import is running fast without any problems. At present it is OCRing the documents. Add to that, significantly fewer files are written to /tmp (I had something daft like 200.000 files or so before...) So far I see about 500 - the pages from the document.

I guess I will see in "a while" (whenever...) if this helps. Incidentally, my previous installation of OpenSemantic Search never calmed down and seemed to continue working indefinitely... (I am using it as a local search engine for my document library. I don't need more than a search engine, so all the machine learning and the semantics support are not important to me.)

mbanks850 commented 4 years ago

What steps did you use to disabled Named Entity Recognition and Graph DB? Do you know what we would loose by disabling those features?

DetlevCM commented 4 years ago

@mbanks850 I only use the interface that is exposed to the user. -> Open Semantic Search interface -> config in the top menu -> "Named Entity Recognition" and "Graph DB (Neo4j)" options.

Both entries have some descriptions. The graph database deals with relationships between documents and the named entity recognition tries to understand the document based on machine learning principles.

Given that the project (based on the description) was developed to deal with data dumps for journalists, such tools may be very, very useful. Using it as a local document search engine, the relations (graph database) become less interesting. The Named Entity Recognition could be useful, but is possibly not well tuned to for example technical documents. It may also be that the Named Entity Recognition deals with the semantics aspets of the search - thus turning it off may make OpenSemantic Search "dumber". Given that I want to search a database of papers and technical documents that I create, this seems fair enough for/to me.

Now for some reason, this has lead to Open Semantic Search not showing me how many files it yet wants to OCR... - But tesserract ocr is the only process hogging the CPU. (I don't think SOLR is particularly heavy for the straightforward searching. It is the part that tries to be clever which is CPU-intensive.)

mbanks850 commented 4 years ago

Thank you, looking at the descriptions, Graph DB is not something I will need. Named Entity maybe, but we are also just using it as a search in technical documents.

RiteshSingh commented 4 years ago

Same issue in OSS 20.04.17 and 20.01.17

rusty9283 commented 3 years ago

Same issue here.

Ubuntu 20.04, OSS 20.11.01 and 21.01.03

Indexing via opensemanticsearch-index-dir -> ~210.000 files. After about 16 hours 2 documents are extracted but CPU is on 100% with 8 tasks from "etl_tasks".

"NER" and "Neo4j" are disabled.

I tried to reset filemonitoring and deleted index but CPU is always on 100% with "etl_tasks" without indexing?

Only if i stop the service "opensemanticetl" the cpu is in normal use.

Has someone news about this?

rusty9283 commented 3 years ago

After some testing I think my problem is maybe another: #341

movanet commented 3 years ago

Same issue here. It's been a few months since this post. Did you encountered any other file import issue after this? Also, would it help if we turn it off after the fact (after it got stucked) or do we need to clean start and do another indexing?

@mbanks850 I only use the interface that is exposed to the user. -> Open Semantic Search interface -> config in the top menu -> "Named Entity Recognition" and "Graph DB (Neo4j)" options.

nikhilbhalwankar commented 3 years ago

I had the same problem today. Around 4,80,000 documents got indexed but they were stuck at file import. I waited for around 6 to 7 hours but still it looked to be stuck. I restarted the server and the import process started automatically. I am not sure but it looks like issue is something related to flower server worker. I am using virtual machine appliance (21.01.17).

HenryJones23 commented 1 year ago

This issue is still unresolved. I have probably encountered the same problem (Open Semantic Search installation package from 22.10.08). It regularly hangs during the extraction of files (see issue #461 for details). Did you guys ever find any solution to this?

Pooja1905 commented 5 months ago

I am facing similar issue. Can anyone guide me on this? I have checked error logs of solr, syslogs etc and there doesn't seem to be any errors as such. The CPU utilisation of my EC2 instance seems to be quite busy and not idle. I have delted the indexes/indices and recreated them a couple of times, but there is not change in the total number " Running file imports ..." stats. I have 95-100 gb of data (mixed media - pdfs, images, videos, audios, pngs, csv etc)

I Have left it alone for 2 days now and it hasn't made a dent in the numbers, however the cpu utilization is 80-95 % on average.