monarch-initiative / embiggen

🍇 Embiggen is the Python Graph Representation learning, Prediction and Evaluation submodule of the GRAPE library.
BSD 3-Clause "New" or "Revised" License
39 stars 12 forks source link

CSV parser #178

Closed pnrobinson closed 4 years ago

pnrobinson commented 4 years ago

I am trying to figure out how to modify the text encoder to read in a CSV file that includes one "sentence" in field 3 (index 2) for each row. We will use this for the PubMed analysis. I have shown the intended usage in tests/test_text_encoder.py in

class TestCsvTextEncoder(TestCase):

However, I am getting various errors and clearly do not understand everything in the existing text encoder class. Any help appreciated.

PS, this is in the branch csv_reader

callahantiff commented 4 years ago

@pnrobinson - I am happy to help take a look at this first thing tomorrow.

pnrobinson commented 4 years ago

@callahantiff Thanks, awesome!

callahantiff commented 4 years ago

@pnrobinson - morning! I am going to start working on this now. Any chance you have small sample .csv file I can use to make sure we have the functionality correct for your intended use? Could be good to add it to the test data directory so we can reuse it in a test too.

callahantiff commented 4 years ago

Found the data, should have read your prompt after I drank coffee 😺

callahantiff commented 4 years ago

@pnrobinson - Should be good to go. Want to give it a try now? I updated your test class so you should be able to examine the code using it.


Changes Made: Rather than using a separate (build_dataset_from_csv) method, I modified process_input_text() and added 3 new optional attributes to the TextEncoder() class:

To use the original functionality for reading in text data, do not pass any arguments to payload_index, header, or delimiter.


To utilize this method you would do the following:

encoder = TextEncoder(filename='tests/data/pubmed20n1015excerpt.txt',
                      payload_index=2,
                      header=None,
                      delimiter='\t',
                      data_type='sentences')

data, wrd_count, wrd_dictionary, rev_wrd_dictionary = encoder.build_dataset()

tests/test_text_encoder.py - I extended each of the three test classes' coverage to include the input parameters when initializing the class. All tests on this script currently pass. In the future, we should probably consider drying out these tests, there is a lot of duplicated functionality.

pnrobinson commented 4 years ago

@callahantiff This looks great! I think we can merge.

callahantiff commented 4 years ago

@pnrobinson - awesome! If you want, I will fix the problem in issue #179 and we can include those changes too?

pnrobinson commented 4 years ago

Sounds great! Thanks