juliema / label_reconciliations

Code for reconciling multiple transcriptions for a label
MIT License
26 stars 11 forks source link

Memory error with large .csv files #61

Closed PmasonFF closed 4 years ago

PmasonFF commented 4 years ago

Attempting to reconcile a 27Mb .csv file using reconcile.py. Total of 35 columns of which only two are being reconciled. Running a Windows 10 machine with 12 Gb. Pycharm ide and Python 3.7 32 bit. reconcile.py 0.4.7. no other jobs running. It appears the reconciled file with explanations is created, but the script fails with a MemoryError when writing the summary file: Traceback (most recent call last): File "C:/py/WWICards/reconcile.py", line 329, in main() File "C:/py/WWICards/reconcile.py", line 322, in main reconcile_data(args, unreconciled, column_types) File "C:/py/WWICards/reconcile.py", line 298, in reconcile_data args, unreconciled, reconciled, explanations, column_types) File "C:\py\WWICards\lib\summary.py", line 70, in report out_file.write(summary) MemoryError

If I split the file into sections of approximately 1/3, reconcile.py runs without issues - so I have a workaround but it is a pain - Any suggestions?

rafelafrance commented 4 years ago

I am checking for a reasonable way to split the Jinja2 templates (the HTML templating engine) into chunks.

PmasonFF commented 4 years ago

This may also be an issue with 32 bit vs 64 bit Python - I am running 3.7 32 bit, a colleague is running 3.8 64 bit and has no issue with the same input files, on a similar Windows machine. I am not looking forward to upgrading.

I use reconcile.py for a number of non NfN zooniverse transcription projects. These projects tend to have many columns which do not need reconciliation (metadata, and other non-transcription workflow tasks) - I expect this adds to the load for the summary which shows everything (which is desirable, or at least the ability to show certain columns which are not being reconciled).

I bless you guys and gals every time I run reconcile.py Thank you! Peter

rafelafrance commented 4 years ago

I'm sort of shooting blind here but I hope I have a patch for you. Please let me know if this works.

PmasonFF commented 4 years ago

I replaced summary.py and reconcilliations.js with these revised versions, assembled the full file to reconcile ( 36Mb+, 27 columns ( 22 to be reconciled), 195000 lines). Still got the MemoryError. But the reconciled file was complete and correct, the summary file 0 size.

I also tried it with just summary.py updated - ( ie with the original reconciliations.js) Under those conditions the script ran to completion, the summary file produced the upper section and the reconciliation summary, but of course not the detail section below. As far as I can tell the summary section that was produced looked correct in all details.

I assume it builds the entire summary html file in memory before writing. Is Python hitting the sys.maxsize or the largest memory the 32 bit architecture can reference?

rafelafrance commented 4 years ago

The change was supposed to bypass building in memory but I can see why it didn't.

Two more things that I can try (without a complete rewrite). Please pull and try again.

PmasonFF commented 4 years ago

Pulled the entire repository this time (not just the changed files). Unfortunately I got the same MemoryError. (I did verify the version of reconcile was with the modifications just in case I had not installed correctly.) I am wondering if I should try setting up a virtual environment with the 64 bit Python???

PmasonFF commented 4 years ago

That works - setting up a virtual environment with the Windows 64 bit version of Python 3.85, and all the most recent versions of the other required packages executed flawlessly - all 185MB+ summary file seems to be there and working!

Where to from here?

rafelafrance commented 4 years ago

If you have a working version then let's let it rest for now. I'll make a note in the readme.md about trouble with 32-bit versions of python and large files.