Solution for delimiter reading issue (#210) utilizing dynamic checkpoints

JensWendt commented 11 months ago

Hello,

as detaild in #210 I implemented 3 dynamic checkpoints (at 25%/50%/75%) instead of fixed 500/1000/2000 characters.

I tried it with a fairly big .csv of 1.5 million characters and it worked as fast as the original solution.

I also tried around a bit with a smaller .csv with a lot of deletions and random other delimiters in between. It performed reasonably well for such fringe cases, meaning it could not resolve the correct delimiter in 2 of 10 or so cases.

Please take a look and tell me what you think. Thank you for your time!

will-moore commented 11 months ago

if parsing fails at 50% of the file, maybe it makes sense to simply read ALL of the file for the final try? Hopefully this won't be needed very often, but if it is then I expect most users would be willing to wait a tiny bit longer to give them the best chance of correctly reading the file?

will-moore commented 11 months ago

The build is failing with flake8 errors:

./omero/annotation_scripts/KeyVal_from_csv.py:167:21: E128 continuation line under-indented for visual indent
./omero/annotation_scripts/KeyVal_from_csv.py:172:80: E501 line too long (81 > 79 characters)
./omero/annotation_scripts/KeyVal_from_csv.py:174:25: E128 continuation line under-indented for visual indent
./omero/annotation_scripts/KeyVal_from_csv.py:179:80: E501 line too long (88 > 79 characters)
./omero/annotation_scripts/KeyVal_from_csv.py:181:25: E128 continuation line under-indented for visual indent

will-moore commented 11 months ago

Testing on a small csv file worked fine for me. I'm not going to do extensive testing on lots of csvs, but I can confirm with the test csv (on issue above), the original code doesn't parse the file correctly but this code does.

Screenshot 2023-09-07 at 12 16 29

JensWendt commented 11 months ago

if parsing fails at 50% of the file, maybe it makes sense to simply read ALL of the file for the final try? Hopefully this won't be needed very often, but if it is then I expect most users would be willing to wait a tiny bit longer to give them the best chance of correctly reading the file?

Yes, and no. I think that might be a general discussion. Why not immediatly read 100% and take the safest route? Admittedly, the checkpoints are arbitrary and not chosen with a sound reasoning. But we are talking about timespans in the single digit milliseconds (according to %%timeit in the juypter notebook Sandbox that I use for trying this out). So it might not be too important where to put the checkpoints?

But, if you have strong feelings about this then I will kill the 75% checkpoint.

JensWendt commented 11 months ago

Testing on a small csv file worked fine for me. I'm not going to do extensive testing on lots of csvs, but I can confirm with the test csv (on issue above), the original code doesn't parse the file correctly but this code does.

Yes, I did quite some testing, primarily with fringe .csvs where lots of cells where missing or other delimiters where intermingled. But in the end we will have to rely on the community as testers and if something weird comes up I will try and fix it quickly.

JensWendt commented 10 months ago

bump

will-moore commented 10 months ago

Re 75% - I just thought that if the last 25% of the file had some useful content that could help with picking a delimiter then it would make sense not to ignore it, since you care more about actually getting the right value at this point (if it's failed on a smaller portion of the file) than you do about speed, since this will only be used at a small portion of occasions? But I don't have any evidence or stats on how much more effective it would be, so happy to go for what you've got.

will-moore commented 10 months ago

Hi @JensWendt - apologies, I should have asked earlier, but could you please fill out a CLA form as described at https://ome-contributing.readthedocs.io/en/latest/cla.html This is a requirement to cover all of OME's GPL-licensed projects Many thanks!

JensWendt commented 10 months ago

Re 75% - I just thought that if the last 25% of the file had some useful content that could help with picking a delimiter then it would make sense not to ignore it, since you care more about actually getting the right value at this point (if it's failed on a smaller portion of the file) than you do about speed, since this will only be used at a small portion of occasions? But I don't have any evidence or stats on how much more effective it would be, so happy to go for what you've got.

Would you be okay with another fourth nested try for reading the whole file? I mean it seems I will be touching the script again soon anyways.

JensWendt commented 10 months ago

Hi @JensWendt - apologies, I should have asked earlier, but could you please fill out a CLA form as described at ome-contributing.readthedocs.io/en/latest/cla.html This is a requirement to cover all of OME's GPL-licensed projects Many thanks!

here you go: Binder1.pdf

will-moore commented 10 months ago

Great, thanks for that.

This PR will be included in an upcoming OMERO.server release. Unfortunately the scripts don't get released unless we release the server, which is a limitation. But feel free to continue to improve the script.

ome / omero-scripts

Solution for delimiter reading issue (#210) utilizing dynamic checkpoints #211