ome / omero-scripts

Core OMERO Scripts
https://pypi.org/project/omero-scripts/
12 stars 32 forks source link

Improved delimiter sniffing for .csv files #210

Open JensWendt opened 1 year ago

JensWendt commented 1 year ago

Hello,

Adding to a previous discussion we had #195, I would like to propose an improvement to the delimiter sniffing technique we use. Why do I come up with this now? Because @SchmChris found a .csv which could not be properly resolved with the current code. the .csv in question: test.csv

the current code reads:

with open(temp_name, 'rt', encoding='utf-8-sig') as file_handle:
            try:
                delimiter = csv.Sniffer().sniff(
                    file_handle.read(500), ",;\t").delimiter
                print("Using delimiter: ", delimiter,
                      " after reading 500 characters")
            except Exception:
                file_handle.seek(0)
                try:
                    delimiter = csv.Sniffer().sniff(
                        file_handle.read(1000), ",;\t").delimiter
                    print("Using delimiter: ", delimiter,
                          " after reading 1000 characters")
                except Exception:
                    file_handle.seek(0)
                    try:
                        delimiter = csv.Sniffer().sniff(
                            file_handle.read(2000), ";,\t").delimiter
                        print("Using delimiter: ", delimiter,
                              " after reading 2000 characters")
                    except Exception:
                        print("Failed to sniff delimiter, using ','")
                        delimiter = ","
            # reset to start and read whole file...
            file_handle.seek(0)
            data = list(csv.reader(file_handle, delimiter=delimiter))

I propose ditching the fixed values of characters we feed into the sniffer and go for a dynamic approach where we read the first quarter, half , three quarters or the whole file.

from math import floor
with open("temp_name", 'rt', encoding='utf-8-sig') as file_handle:
    file_length = len(file_handle.read(-1))
    try:
        delimiter = csv.Sniffer().sniff(
            file_handle.read(floor(file_length/4)), ",;\t").delimiter
        print("Using delimiter: ", delimiter,
              f" after reading {floor(file_length/4)} characters")
    except Exception:
        file_handle.seek(0)
        try:
            delimiter = csv.Sniffer().sniff(
                file_handle.read(floor(file_length/2)), ",;\t").delimiter
            print("Using delimiter: ", delimiter,
                  f" after reading {floor(file_length/2)} characters")
        except Exception:
            file_handle.seek(0)
            try:
                delimiter = csv.Sniffer().sniff(
                    file_handle.read(floor(file_length*0.75)), ",;\t").delimiter
                print("Using delimiter: ", delimiter,
                      f" after reading {floor(file_length*0.75)} characters")
            except Exception:
                file_handle.seek(0)
                try:
                    delimiter = csv.Sniffer().sniff(
                        file_handle.read(file_length), ",;\t").delimiter
                    print("Using delimiter: ", delimiter,
                          " after reading all characters")
                except Exception:
                    print("Failed to sniff delimiter, using ','")
                    delimiter = ","
# reset to start and read whole file...
    file_handle.seek(0)
    data = list(csv.reader(file_handle, delimiter=delimiter))

This will cost maybe a bit more time, but will more reliably lead to a successful "sniffing" of the delimiter.

Does this sound like a good idea, am I missing some issue or is there maybe a more elegant approach to implement this?

When this more robust method is implemented (in whatever fashion) I would also like to adapt the code for populate_metadata.py and OMERO.parade to include this, as the delimiter-issue will come up for any german localized Excel version and possibly prevents new OMERO users from utilizing this part of the OMERO functionality.

will-moore commented 1 year ago

This reminded me of a similar discussion about auto-detection of encoding for csv files at https://github.com/ome/omero-scripts/pull/198

Although now I read above a bit more carefully I see that the suggested changes are quite limited. Do you have an idea why the changes lead to a better outcome for the example csv? Does the existing code fail to read far enough, or does it choose an incorrect delimiter too early?

JensWendt commented 1 year ago

Yes, I followed that one closely as well, but fortunately this problem here seems less complicated.

The test.csv is a bit over 4100 characters long and has ~20 rows. The header provides enzyme inhibitor as the last column, but actual data for that column only shows up at row 9. Following the explanation found here (yeah, I know Quora is not the best source of information, but I found nothing else and the explanation makes sense ¯\_(ツ)_/¯ ) it seems the "variance" of the delimiter occurence is too high in the first 500 characters (roughly 5 rows) and 1000 characters (roughly 10 rows). When I tell him "manually" to sniff the whole lenght he actually gives me back the correct ";", so it seems that the 2000 characters of the last nested try are just on the edge to not be enough to precisely determine the delimiter.

I assume that we will encounter .csvs in the future with gaps in the data causing "variance" and therefore now my proposed solution to dynamically go quarter, half, three quarters and ultimately full length of characters. Hoping that if everything else fails just throw the whole thing into the sniffer and hope for the best.

I will try out a bit longer test.csv with more missing data values later on, just to see how it behaves and where the limits might be.

JensWendt commented 1 year ago

Okay, I played around quite a bit with differing amounts of missing values (represented by "\<value>;;\<value>;\<value>;;" --> delimiters with nothing in between) and it seems like 50% is the sweet spot. I "massacred" the data within reasonable limits and it mostly finds the delimiter after reading 50% of the characters, sometimes e.g. when I leave a column (or more) almost empty, it has to read 75%.

JensWendt commented 12 months ago

Should I provide an official PR for this solution?

will-moore commented 12 months ago

Hi @JensWendt yes please - I think that would be helpful for discussion and testing, thanks.