machinalis / iepy

Information Extraction in Python
BSD 3-Clause "New" or "Revised" License
906 stars 186 forks source link

Big texts are breaking import process #48

Closed dchaplinsky closed 9 years ago

dchaplinsky commented 10 years ago

I've tried to import my corpus of ukrainian texts and apparently one of them was too big for iepy:

Added 2503 documents
Traceback (most recent call last):
  File "bin/csv_to_iepy.py", line 29, in <module>
    csv_to_iepy(filepath)
  File "/Users/dchaplinsky/Projects/pullenti-ukr/iepy/venv/lib/python3.4/site-packages/iepy/utils.py", line 111, in csv_to_iepy
    for i, d in enumerate(reader):
  File "/usr/local/Cellar/python3/3.4.1_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/csv.py", line 110, in __next__
    row = next(self.reader)
_csv.Error: field larger than field limit (131072)

While I understand that having such a text in corpus is a bit stupid I think good solution here would be:

jmansilla commented 10 years ago

Hi Dmitry!

+1 to that simple approach.

If you have a patch done (or time for doing it), send a pull request to develop branch with it.

Otherwise, will probably be done here soon.

Something else to add: we are providing a csv importer as an easy way to start with IEPY, but should be pretty easy to create importers from other formats.