neuroquery / pubget

Collecting papers from PubMed Central and extracting text, metadata and stereotactic coordinates.
https://neuroquery.github.io/pubget/
MIT License
20 stars 12 forks source link

Out of memory error in running pipeline with v.0.0.1 #5

Closed complexbrains closed 1 year ago

complexbrains commented 2 years ago

Hi again,

I wanted to report a problem I have encountered a couple of times in running the running nqdc pipeline with version 0.01 under Windows 10 through a whole night-long process. Apparently, it is a low memory problem caused by the vectorization of the papers, and haven't had a chance to reproduce the error with the code in the main repo yet but wanted to let you know in case it might need any additional check.

DEBUG 2022-06-15T00:23:40+0100 _vectorization vectorizing articles 200 to 400 / 2636 DEBUG 2022-06-15T02:10:20+0100 _vectorization vectorizing articles 400 to 600 / 2636 DEBUG 2022-06-15T02:44:58+0100 _vectorization vectorizing articles 600 to 800 / 2636 DEBUG 2022-06-15T02:45:49+0100 _vectorization vectorizing articles 800 to 1000 / 2636 DEBUG 2022-06-15T03:05:08+0100 _vectorization vectorizing articles 1000 to 1200 / 2636 DEBUG 2022-06-15T03:09:34+0100 _vectorization vectorizing articles 1200 to 1400 / 2636 DEBUG 2022-06-15T03:21:18+0100 _vectorization vectorizing articles 1400 to 1600 / 2636 DEBUG 2022-06-15T03:25:12+0100 _vectorization vectorizing articles 1600 to 1800 / 2636 DEBUG 2022-06-15T03:27:30+0100 _vectorization vectorizing articles 1800 to 2000 / 2636 Traceback (most recent call last): File "C:\users\bilgi\anaconda3_new\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "C:\users\bilgi\anaconda3_new\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\bilgi\Anaconda3_new\Scripts\nqdc_full_pipeline.exe\__main__.py", line 7, in <module> File "C:\users\bilgi\anaconda3_new\lib\site-packages\nqdc\_commands.py", line 258, in full_pipeline_command extracted_data_dir, **_voc_kwarg(args.vocabulary_file) File "C:\users\bilgi\anaconda3_new\lib\site-packages\nqdc\_vectorization.py", line 107, in vectorize_corpus_to_npz extracted_data_dir, output_dir, vocabulary_file File "C:\users\bilgi\anaconda3_new\lib\site-packages\nqdc\_vectorization.py", line 125, in _do_vectorize_corpus_to_npz extraction_result = vectorize_corpus(extracted_data_dir, vocabulary_file) File "C:\users\bilgi\anaconda3_new\lib\site-packages\nqdc\_vectorization.py", line 265, in vectorize_corpus corpus_file, vocabulary_file File "C:\users\bilgi\anaconda3_new\lib\site-packages\nqdc\_vectorization.py", line 177, in _extract_word_counts pd.read_csv(corpus_file, encoding="utf-8", chunksize=chunksize) File "C:\users\bilgi\anaconda3_new\lib\site-packages\pandas\io\parsers\readers.py", line 1024, in __next__ return self.get_chunk() File "C:\users\bilgi\anaconda3_new\lib\site-packages\pandas\io\parsers\readers.py", line 1074, in get_chunk return self.read(nrows=size) File "C:\users\bilgi\anaconda3_new\lib\site-packages\pandas\io\parsers\readers.py", line 1047, in read index, columns, col_dict = self._engine.read(nrows) File "C:\users\bilgi\anaconda3_new\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 224, in read chunks = self._reader.read_low_memory(nrows) File "pandas\_libs\parsers.pyx", line 813, in pandas._libs.parsers.TextReader.read_low_memory File "pandas\_libs\parsers.pyx", line 857, in pandas._libs.parsers.TextReader._read_rows File "pandas\_libs\parsers.pyx", line 843, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas\_libs\parsers.pyx", line 1917, in pandas._libs.parsers.raise_parser_error PermissionError: [Errno 13] Permission denied

do you suggest any workaround some sort or is it really the demonstration that it is time to move on to the main repo. Will do today!

jeromedockes commented 2 years ago

Hi, the csv containing articles is read 200 rows at a time; it seems the articles 1800 to 2000 filled the memory. we may reduce the batch size to read and process less articles at a time; there is a tradeoff between speed and memory usage. how much memory was available to the process that produced this log? and would it be possible to share the data directory with the data that had been downloaded and processed so far?

jeromedockes commented 2 years ago

is it really the demonstration that it is time to move on to the main repo.

I don't think so but indeed the release is really outdated so I wouldn't recommend using it; if installing from the repo is not convenient just let me know I can push a v 0.0.2 release to PyPI easily

jeromedockes commented 1 year ago

after IRL discussion this seems to be resolved so closing