instructlab / sdg

Python library for Synthetic Data Generation
Apache License 2.0
12 stars 28 forks source link

Investigate Dataset.map() multiprocessing failure #123

Closed russellb closed 1 month ago

russellb commented 1 month ago

This issue is called out in one of the commits in PR #117

The second issue is specific to map():

ValueError: The features can't be aligned because the key score of features {'task_description': Value(dtype='string', id=None), 'seed_question': Value(dtype='string', id=None), 'seed_response': Value(dtype='string', id=None), 'num_samples': Value(dtype='int64', id=None), 'question': Value(dtype='string', id=None), '__index_level_0__': Value(dtype='int64', id=None), 'evaluation': Value(dtype='string', id=None), 'score': Value(dtype='string', id=None)} has unexpected type - Value(dtype='string', id=None) (expected either Value(dtype='float64', id=None) or Value("null").

It appears the the datasets, only in the case of num_proc>1, when we hit the "error converting dtype" case and set the column to None, it ends up being still considered a string column rather than the new type.

This second issue deserves further investigation and may require a fix to the datasets library.

The related code in filterblock.py as of that PR is:

def _map_dtype(samples, column, dtype, num_proc=1):
    def convert_column(sample):
        try:
            sample[column] = dtype(sample[column])
        except ValueError as e:
            logger.error(
                "Error converting dtype: %s, filling with None to be filtered later", e
            )
            sample[column] = None
        return sample

    # FIXME: it appears multiprocessing map has issues with
    # None columns. If we pass num_proc>1 here and the error
    # case is triggered above, we get:
    #   ValueError: The features can't be aligned ...
    # because the column is still considered a string not
    # the new dtype.
    num_proc = 1

    return samples.map(convert_column, num_proc=num_proc)

We need to investigate this error more deeply to figure out the best fix

markmc commented 1 month ago

For reference, the type conversion failure handling case was introduced in #78

markmc commented 1 month ago

A tiny bit of searching led me to stuff like huggingface/datasets#3195 and huggingface/datasets#3676 - giving the impression that None is supposed to be handled, but it wouldn't surprise anyone if there were still latent bugs in None handling

markmc commented 1 month ago

From @russellb

Maybe we need a default value we set that's native to whatever we're trying to convert to. Otherwise right now we have a mix of things converted to a numeric type (in the case of the commit message) and None isn't valid for those.

Knowing what value we can use in this error case doesn't seem simple. It seems like it needs to be provided as configuration to the FilterBlock.

I did consider using a different default value than None - any value of that type that is not in filter_value would work, so we could randomly choose one? Ugh.

I also considered adding an "invalid_value" field, but adding something like that to the format to work around a bug? Ugh.

block_type: FilterByValueBlock
   block_config:
     block_name: filter_questions
     filter_column: score
     filter_value: 1.0
     invalid_value: 0
     operation: eq
     convert_dtype: float
russellb commented 1 month ago

This change looks related: https://github.com/aakankshaduggal/sdg/pull/7/files

It's just dropping the samples converted to None completely

markmc commented 1 month ago

This change looks related: https://github.com/aakankshaduggal/sdg/pull/7/files

It's just dropping the samples converted to None completely

From #110 I'm guessing it is to avoid:

TypeError: '>=' not supported between instances of 'NoneType' and 'float'

i.e. what we default to on error needs to be a valid value for any of the supported operators

I did consider using a different default value than None - any value of that type that is not in filter_value would work, so we could randomly choose one? Ugh.

So, no ... it's any value of that type which will cause the operator to return False

markmc commented 1 month ago

Pretty sure this is resolved by https://github.com/instructlab/sdg/pull/143