Handle white space better

h2oai / datatable

A Python package for manipulating 2-dimensional tabular data structures

https://datatable.readthedocs.io

Mozilla Public License 2.0

1.81k stars 155 forks source link

Handle white space better #514

Open pseudotensor opened 6 years ago

pseudotensor commented 6 years ago

OrdonezA_ADLs.txt

pseudotensor commented 6 years ago

This also leads to failure in autodl at end, seemingly because tabs or something else about the column name cannot be handled as a key. @arnocandel seed = 73704

Applying transformation to test set concurrent.futures.process._RemoteTraceback: """ concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/concurrent/futures/process.py", line 175, in _process_worker r = call_item.fn(*call_item.args, call_item.kwargs) File "/home/jon/h2oai/h2oaicore/auto_dl_support.py", line 1397, in score_test_set_subprocess test_munged_df = pipeline.transform(test_df) File "/home/jon/h2oai/h2oaicore/transformers.py", line 368, in transform new_X = self.pipe.transform(X, fit_params) File "/home/jon/h2oai/h2oaicore/transformers.py", line 689, in transform output = pipeline.transform(X) File "/home/jon/h2oai/h2oaicore/transformers.py", line 1590, in transform x_new = X[unroll_list(self.column_names)].reset_index(drop=True).copy() File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/pandas/core/frame.py", line 2053, in getitem return self._getitem_array(key) File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/pandas/core/frame.py", line 2097, in _getitem_array indexer = self.ix._convert_to_indexer(key, axis=1) File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/pandas/core/indexing.py", line 1230, in _convert_to_indexer raise KeyError('%s not in index' % objarr[mask]) KeyError: "['10:18:11\t\tSleeping'] not in index" """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/concurrent/futures/process.py", line 175, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/home/jon/h2oai/h2oaicore/auto_dl_support.py", line 1330, in _check_ret_subprocess ids_cols=None, classification=classification) File "/home/jon/h2oai/h2oaicore/auto_dl_support.py", line 1379, in score_test_set target, labels, ids_cols, classification).result() File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/concurrent/futures/_base.py", line 405, in result return self.get_result() File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/concurrent/futures/_base.py", line 357, in get_result raise self._exception KeyError: "['10:18:11\t\tSleeping'] not in index" """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "tests/test_integration/test_random.py", line 598, in test_random() File "tests/test_integration/test_random.py", line 591, in test_random check_ret(ret, target, train_df, test_df, max_rows, classification, seed, cv_folds) File "/home/jon/h2oai/h2oaicore/auto_dl_support.py", line 1280, in check_ret future.result() File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/concurrent/futures/_base.py", line 405, in result return self.get_result() File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/concurrent/futures/_base.py", line 357, in get_result raise self._exception KeyError: "['10:18:11\t\tSleeping'] not in index"

pseudotensor commented 6 years ago

another.zip

This one gives the below. Looks like dt just can't handle extra spaces/tabs in places, and error message is incorrect. Instead the issue is how dt interprets white space.

>       df = read_data(logger, train, "training")

tests/test_integration/test_random.py:445: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
h2oaicore/systemutils.py:948: in read_data
    df = fread_wrapper(train_data, logger)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

train_data = '/tmp/65581/uci65581_seeds_dataset.txt'
logger = <RootLogger root (DEBUG)>

    def fread_wrapper(train_data, logger):
        try:
            df = dt.fread(train_data, logger=logger)
        except Exception as e:
            s = "\n%s: %s" % (e.__class__.__name__, str(e))
            s += "\n\nFirst few lines of the file (anonymized):\n\n"
            l = ""
            with open(train_data) as f:
                for i, line in enumerate(f):
                    if i > 10:
                        break
                    l += line
>           raise InvalidDataError("Couldn't parse file '%s'\n%s" % (train_data, s + anonymize(l)))
E           h2oaicore.systemutils.InvalidDataError: Couldn't parse file '/tmp/65581/uci65581_seeds_dataset.txt'
E           
E           RuntimeError: Line 8 from sampling jump 0 starting "14.11   14.1" has more than the expected 8 fields. Separator 8 occurs at position 36 which is character 2 of the last field: "5     1". Consider setting 'comment.char=' if there is a trailing comment to be ignored.
E           
E           First few lines of the file (anonymized):
E           
E           99.99   99.99   9.999   9.999   9.999   9.999   9.99    9
E           99.99   99.99   9.9999  9.999   9.999   9.999   9.999   9
E           99.99   99.99   9.999   9.999   9.999   9.999   9.999   9
E           99.99   99.99   9.9999  9.999   9.999   9.999   9.999   9
E           99.99   99.99   9.9999  9.999   9.999   9.999   9.999   9
E           99.99   99.99   9.9999  9.999   9.999   9.999   9.999   9
E           99.99   99.99   9.9999  9.999   9.999   9.999   9.999   9
E           99.99   99.9    9.9999  9.99    9.999   9.9     9       9
E           99.99   99.99   9.9999  9.999   9.999   9.99    9.999   9
E           99.99   99.99   9.999   9.999   9.999   9.999   9.999   9
E           99.99   99.99   9.9999  9.999   9.999   9.999   9.999   9

h2oaicore/systemutils.py:939: InvalidDataError
=============================== warnings summary ===============================
tests/test_integration/test_random.py::test_random
  /home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

  The code that caused this warning is on line 1599 of the file /opt/pycharm-community-2017.2.3/helpers/pydev/pydevd.py. To get rid of this warning, change code that looks like this:

   BeautifulSoup(YOUR_MARKUP})

pradkrish commented 3 years ago

@st-pasha

To fix this bug, should repeated tabs be treated as a single tab? For example, the following row

14.11   14.1    0.8911  5.42    3.302   2.7     5       1

is internally treated as

14.11\t14.1\t0.8911\t5.42\t3.302\t2.7\t\t5\t\t1\n

and ends up padding extra columns

st-pasha commented 3 years ago

Well, the problem is that TSV (tab-separated values) is a popular format where the values are separated with tabs. So if you have a row like [1, NA, 2] then it ends up encoded as 1\t\t2\n, which is why tabs must be treated as separate delimiters.

The correct solution here is first to try parsing the file as a table of values, where each column is expected to be vertically aligned with arbitrary number of tabs/spaces. We can try different choices of tab widths too: 8, 4, 2. Note that even within a table there could be gaps where the values are missing, in which case it will look like a sequence of whitespace spanning multiple columns.

pradkrish commented 3 years ago

is that so, the example you chose [1, NA, 2] is encoded as 1\tNA\t2. [1, , 2] is encoded as 1\t\t2\n.

How about this: let's say, in the first run, we remove all the extra tabs (which are placeholders for missing values) and check the number of inferred columns of each row. If they all match, then that is the most likeliest configuration. If they don't match, we just reintroduce the removed tabs from before and carry on as usual. Let me illustrate this with a 3*3 example

[1,2,NA]
[1, ,2] --------> 1\t2\tNA\n1\t\t2\n1\t\t2\n will become 1\t2\tNA\n1\t2\n1\t2\n [1, ,2]

After removing extra tabs, row1 will be 3 columns, row2 will be 2 columns, row3 will be 2 columns. Since they don't match, reintroduce the removed tabs and run again, the result will be a (3 * 3) table. As a second example

[1,2, ]
[1, ,2] --------> 1\t2\n1\t\t2\n1\t\t2\n will become 1\t2\n1\t2\n1\t2\n [1, ,2]

After removing extra tabs, row1 will be 2 columns, row2 will be 2 columns, row3 will be 2 columns. Since they do match, that is the likeliest configuration and the result will be a (3 * 2 ) table. This is in essence what's happening in another.zip. Here, I am assuming that we would like to treat the second example as a 3*2 table and not as a 3*3 table, which is why someone had raised this issue in the first place.

One more ingredient is necessary to make this work. Remove the tab whenever it is immediately followed by a newline. So [1,2, ] becomes 1\t2\n1 and not 1\t2\t\n1.

According to this logic, we run through every frame twice. Let me know what you think. I sincerely hope that was clear. :)

st-pasha commented 3 years ago

So what you are suggesting is that we have a new separator: \t+, which means "one or more tabs", similar to how we already have separator \s+. And if during the detection phase it manages to produce a more consistent number of columns than \t separator, then we should use \t+ for parsing.

I agree that this would parse another.zip, and the original Ordonez file (assuming we trim the trailing tabs).

However, I'm worried that this would not work if the data contained any gaps (NAs). Whenever there is any such gap, two separators will be next to each other, and will be processed as a single separator, causing all subsequent fields to be shifted. For example, consider the following example:

Start time    End time     Activity
02:27:59      10:18:11     Sleeping
10:21:24                   Toileting
10:25:44      10:33:00     Showering

Even without counting the spaces or tabs here, you know just by looking that there are 3 columns here, and that the 2nd row should be read as ("10:21:24", None, "Toileting"). And the reason you know this is that the columns are visually aligned to the same positions. So, a reader ultimately has to implement the same logic: detect if the input consists of vertically aligned columns, and then parse accordingly (which won't even be a csv reader).

pradkrish commented 3 years ago

If we were to remove all the extra tabs for missing values in your example, row0, row1 and row3 will be inferred to contain 3 columns while row2 will have 2 columns. Since the values don't match, according to my logic, we will fall back to the current way of arranging. It's only when the number of inferred columns for all rows match, we go with the new method. I don't see why it doesn't work. :thinking:

st-pasha commented 3 years ago

Yes, but the current way of arranging will also fail because there are multiple tabs/spaces between the values there. So the dataset will fail to parse, even though it looks perfectly "normal" to human eye.