Error with assert_submission : "ValueError: Input contains NaN, infinity or a value too large for dtype('float32')."

Hello,

A strange bug occurs on the python4DS Airplanes prediction challenge which prevents us from making submissions. We have the same structure as the use_external_data submission example, which is to say :

a merging external data function which joins a unique external_data.csv on 3 keys : DateOfDeparture, Departure and Arrival ;
a date processing function which creates the necessary dates ;
and within the pipeline, an OrdinalEndoder and a SimpleImputer.

When removing the merging function, everything works fine but it obviously becomes impossible to use external data.

When building the pipeline in a notebook, everything works fine and we can use the cross_val_score function to assert our estimator. However, ramp-test fails and returns this error. Our external_data doesn't contain any NaNs, and as I said the merging works flawlessly (the merged dataframe contains 8092 lines with no NaNs) in the notebook.

We implemented a temporary fix wich simply joins the two datasets (which is risky in terms of potential erroneous joins). ramp-test works with this version, returning final statistics on the submission. But this version however fails when posted on ramp-studio.

We managed to print some results, and it appears that the merge works for the 2 first rounds of CV 0, before failing unexpectedly with all columns from the external_data containing NaNs (as it can be seen on the joined screenshot).

We are completly stuck on this issue and have spent more than 10 hours trying to fix it without finding what we are doing wrong in our submission.

We joined an environment.yml file in order to potentially replicate the issue, as I guess it could be linked to a version issue with pandas or scikit-learn. NaN issue.zip

paris-saclay-cds / ramp-workflow

Error with assert_submission : "ValueError: Input contains NaN, infinity or a value too large for dtype('float32')." #259