htrc / htrc-feature-reader

Tools for working with HTRC Feature Extraction files
37 stars 12 forks source link

Change indexes returned by pd.read_parquet(tokenlist) (not public API). #41

Open bmschmidt opened 3 years ago

bmschmidt commented 3 years ago

Long message to explain why I want to change a single test.

Some background here is what happens in writing to parquet with indices. Currently, htrc_features writes to parquet files with indices. Previously I thought this was a performance optimization. But in the parquet file format, pandas-style indexes don't exist. What happens behind the scenes is that indices are turned into columns on disk, and a cheat sheet about the pandas index structure in the metadata. Then when a parquet file is read from disk, that metadata is used to rebuild the indexes. Doing all this is fairly costly, so it's preferable to just store columns to disk in parquet and then add indexes as needed.

This can make non-pandas interactions much faster. It even opens the way for some optimizations on the pandas methods; e.g., currently if you ask for counts by page and word, it wastes some time building a part-of-speech index you'll never use.

It doesn't impact any tests that use the public API, and it creates no problems loading parquet files created under the old API or under the new one. (Because they're the same, it's just the data that's different). But there is one test that needs to be rewritten to follow this strategy, because it relies on reading the parquet file directly rather than using the API.

    def test_partial_tokenlist_saving(self, volume, tmp_path):
        tkwargs = dict(case=False, pos=False, drop_section=True)
        volume.save(tmp_path, format='parquet', files = ['tokens'], token_kwargs = tkwargs)

        path = os.path.join(tmp_path, os.listdir(tmp_path)[0])
        df = pd.read_parquet(path).reset_index() #<- now needs to be  df = pd.read_parquet(path)
        assert (df.columns == ['page', 'lowercase', 'count']).all()

This now fails because running "reset_index" on a previously unindexed file create a new column called "index" composed of integers [0, 1, 2, ..., n].

Given that it's not part of the API, it seems safe to change this. But I want to check, and see if there's anywhere else that tests might be needed.

organisciak commented 3 years ago

Sounds good. Just update the saving methods to also not save the index, and the loading method to create the index after loading.

Peter Organisciak Assistant Professor, Research Methods and Information Science Morgridge College of Education | University of Denver


From: Benjamin Schmidt @.> Sent: Friday, April 16, 2021 2:40 PM To: htrc/htrc-feature-reader @.> Cc: Subscribed @.***> Subject: [EXTERNAL] [htrc/htrc-feature-reader] Change indexes returned by pd.read_parquet(tokenlist) (not public API). (#41)

Long message to explain why I want to change a single test.

Some background here is what happens in writing to parquet with indices. Currently, htrc_features writes to parquet files with indices. Previously I thought this was a performance optimization. But in the parquet file format, pandas-style indexes don't exist. What happens behind the scenes is that indices are turned into columns on disk, and a cheat sheet about the pandas index structure in the metadata. Then when a parquet file is read from disk, that metadata is used to rebuild the indexes.https://urldefense.com/v3/__https://github.com/apache/arrow/blob/master/python/pyarrow/pandas_compat.py*L777-L780__;Iw!!NCZxaNi9jForCP_SxBKJCA!BU6rXG0NBlfT2jDGlDI4c6lBPmcAshR3R-FkSkIGZaJTTIcgM472rL_BEyFAKKEOSpU$ Doing all this is fairly costly, so it's preferable to just store columns to disk in parquet and then add indexes as needed.

This can make non-pandas interactions much faster. It even opens the way for some optimizations on the pandas methods; e.g., currently if you ask for counts by page and word, it wastes some time building a part-of-speech index you'll never use.

It doesn't impact any tests that use the public API, and it creates no problems loading parquet files created under the old API or under the new one. (Because they're the same, it's just the data that's different). But there is one test that needs to be rewritten to follow this strategy, because it relies on reading the parquet file directly rather than using the API.

def test_partial_tokenlist_saving(self, volume, tmp_path):
    tkwargs = dict(case=False, pos=False, drop_section=True)
    volume.save(tmp_path, format='parquet', files = ['tokens'], token_kwargs = tkwargs)

    path = os.path.join(tmp_path, os.listdir(tmp_path)[0])
    df = pd.read_parquet(path).reset_index() #<- now needs to be  df = pd.read_parquet(path)
    assert (df.columns == ['page', 'lowercase', 'count']).all()

This now fails because running "reset_index" on a previously unindexed file create a new column called "index" composed of integers [0, 1, 2, ..., n].

Given that it's not part of the API, it seems safe to change this. But I want to check, and see if there's anywhere else that tests might be needed.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/htrc/htrc-feature-reader/issues/41__;!!NCZxaNi9jForCP_SxBKJCA!BU6rXG0NBlfT2jDGlDI4c6lBPmcAshR3R-FkSkIGZaJTTIcgM472rL_BEyFAir6AzQY$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAAOH366LP6CTC3D7C5UPITTJCOEFANCNFSM43CGOKVA__;!!NCZxaNi9jForCP_SxBKJCA!BU6rXG0NBlfT2jDGlDI4c6lBPmcAshR3R-FkSkIGZaJTTIcgM472rL_BEyFAYzuIEOE$.