Closed pnrobinson closed 4 years ago
@pnrobinson - I am happy to help take a look at this first thing tomorrow.
@callahantiff Thanks, awesome!
@pnrobinson - morning! I am going to start working on this now. Any chance you have small sample .csv
file I can use to make sure we have the functionality correct for your intended use? Could be good to add it to the test data directory so we can reuse it in a test too.
Found the data, should have read your prompt after I drank coffee 😺
@pnrobinson - Should be good to go. Want to give it a try now? I updated your test class so you should be able to examine the code using it.
Changes Made: Rather than using a separate (build_dataset_from_csv
) method, I modified process_input_text()
and added 3 new optional attributes to the TextEncoder()
class:
payload_index
- an int
specifying the col index to process (default=None
) header
- an int
specifying the row containing file header info (default=None
) delimiter
- a string specifying the file delimiter type (default=None
) To use the original functionality for reading in text data, do not pass any arguments to payload_index
, header
, or delimiter
.
To utilize this method you would do the following:
encoder = TextEncoder(filename='tests/data/pubmed20n1015excerpt.txt',
payload_index=2,
header=None,
delimiter='\t',
data_type='sentences')
data, wrd_count, wrd_dictionary, rev_wrd_dictionary = encoder.build_dataset()
tests/test_text_encoder.py
- I extended each of the three test classes' coverage to include the input parameters when initializing the class. All tests on this script currently pass. In the future, we should probably consider drying out these tests, there is a lot of duplicated functionality.
@callahantiff This looks great! I think we can merge.
@pnrobinson - awesome! If you want, I will fix the problem in issue #179 and we can include those changes too?
Sounds great! Thanks
I am trying to figure out how to modify the text encoder to read in a CSV file that includes one "sentence" in field 3 (index 2) for each row. We will use this for the PubMed analysis. I have shown the intended usage in
tests/test_text_encoder.py
inHowever, I am getting various errors and clearly do not understand everything in the existing text encoder class. Any help appreciated.
PS, this is in the branch
csv_reader