greenelab / snorkeling

Extracting biomedical relationships from literature with Snorkel 🏊
Other
59 stars 17 forks source link

Scaling Snorkeling To Handle Pubmed #20

Closed David-Durst closed 7 years ago

David-Durst commented 7 years ago

I've got the version of the Snorkeling project from the greenelab repo downloaded and running. See [1] at the bottom for a suggestion about that.

I can't run David's code [2] so I can't test the issue myself. However, I've read the code and have a few questions that will help me in my investigation.

  1. What is the exact problem? Is it that CorpusParser.apply with the default implementation of XMLMultiDocPreparser runs out of memory when loading data from the file /home/davidnicholson/Documents/Data/pubmed_docs.xml?

  2. I heard something about a memory leak. Is it still thought that there is a memory leak? If so, why?

  3. Why was chunking done? It appears that "corpus_parser.apply(xml_parser)", when using XMLMultiDocPreparser from snorkeling/All_Relationships/utils/bigdata_utils.py, should follow the scalable process of reading in one document from the XMLMultiDocPreparser, calling CorpusParserUDF.apply once, dereferencing that document so that it can be garbage collected, and then repeating. The parallel version of corpus_parser.apply should have each process follow this loop independently.

-- Notes --

[1] Not sure if it's just me, but in the future you guys might want to recommend that new developers also do these things:

Add conda-forge to their conda channels. Command: conda config --add channels conda-forge

Install icu 56.1 from the conda-forge channel. Command: conda install icu==56.1

I was having trouble loading the library lxml for Snorkel after sourcing the conda env due to a missing shared object from icu version 56. Installing this dependency fixed that.

[2] David's code refers to a Postgres database that I don't have and a path on David's computer that is not in the git repo, /home/davidnicholson/Documents/Data.

danich1 commented 7 years ago

Thanks for opening this issue. Plan to get everything (Fingers crossed) set up by Friday, where you are able to reproduce this issue effectively.

David-Durst commented 7 years ago

Sounds good. Let me know if the project is setup for me to help further or if you solved the problem.

danich1 commented 7 years ago

So the problem looks it was resolved when I upgraded my ram from 32G to 64G. In light of this "magical" fix, I'm not sure if looking at the code will uncover any hidden issues; nonetheless, I will be getting the project up as soon as I can so you can still take a look if desired.