Closed e-maud closed 5 years ago
Notes:
textacy corpus is apparently slow and heavy: https://github.com/chartbeat-labs/textacy/issues/183 (I was having similar pbs trying to load thousands of Doc)
different perspectives were explored to read our bz2 archives: with dask.read_text with smart_open, and none is an easy go. Final choice is boto3, which does not make us dependent of another library.
dask.read_text
smart_open
boto3
[x] implement functions and tests
[x] implement and test with NER and SOLR
Functions: read_jsonlines and readtext_jsonlines (to come in next pull request)
read_jsonlines
readtext_jsonlines
Notes:
textacy corpus is apparently slow and heavy: https://github.com/chartbeat-labs/textacy/issues/183 (I was having similar pbs trying to load thousands of Doc)
different perspectives were explored to read our bz2 archives: with
dask.read_text
withsmart_open
, and none is an easy go. Final choice isboto3
, which does not make us dependent of another library.[x] implement functions and tests
[x] implement and test with NER and SOLR