paris-saclay-cds / ramp-workflow

Toolkit for building predictive workflows on top of pydata (pandas, scikit-learn, pytorch, keras, etc.).
https://paris-saclay-cds.github.io/ramp-docs/
BSD 3-Clause "New" or "Revised" License
68 stars 43 forks source link

Error with assert_submission : "ValueError: Input contains NaN, infinity or a value too large for dtype('float32')." #259

Closed louistransfer closed 3 years ago

louistransfer commented 3 years ago

Hello,

A strange bug occurs on the python4DS Airplanes prediction challenge which prevents us from making submissions. We have the same structure as the use_external_data submission example, which is to say :

When removing the merging function, everything works fine but it obviously becomes impossible to use external data.

When building the pipeline in a notebook, everything works fine and we can use the cross_val_score function to assert our estimator. However, ramp-test fails and returns this error. Our external_data doesn't contain any NaNs, and as I said the merging works flawlessly (the merged dataframe contains 8092 lines with no NaNs) in the notebook.

We implemented a temporary fix wich simply joins the two datasets (which is risky in terms of potential erroneous joins). ramp-test works with this version, returning final statistics on the submission. But this version however fails when posted on ramp-studio.

We managed to print some results, and it appears that the merge works for the 2 first rounds of CV 0, before failing unexpectedly with all columns from the external_data containing NaNs (as it can be seen on the joined screenshot).

We are completly stuck on this issue and have spent more than 10 hours trying to fix it without finding what we are doing wrong in our submission.

We joined an environment.yml file in order to potentially replicate the issue, as I guess it could be linked to a version issue with pandas or scikit-learn. NaN issue NaN issue.zip

heliabrull commented 3 years ago

We encountered the exact same issue in python4DS Airplanes prediction challenge: our external data contains no NaN values, the merging works fine (8092 lines with no NaNs) and the pipeline works perfectly, also when using the cross_val_score function to assert our estimator. When submitting in RAMP, however, the error "Input containing NaN" is raised.