Open sebastian-schindler opened 6 months ago
In the progress of cleaning up the data loading, a switch to pandas DataFrames instead of just numpy arrays is performed. For this, the generic helper function no_nan
, which removes NaN values, has to be able to handle DataFrames. This makes sense, as DataFrames and numpy otherwise often work interchangeably (thus, it should work like this here as well).
plot_highdim
removes all NaN values from the input data, without notifying the user. Is this actually sensible to do?plot_highdim
to read labels from pandas DataFrame automatically.Instead of dealing with final data in the same manner, it would be a good idea to just add the reconstructed obscuration information from the final data to the existing intermediary dataframe. This is a good idea, because then only one dataframe can be used for everything, and most of the data is identical anyways. The final dataset is used only because the reconstructed obscuration is only available there. The final dataset contains a few hundred sources less than the intermediary dataset (a fact we cannot change), but it should be a subsample, as evident from this flowchart from the paper: To add the relevant obscuration column from final to intermediary, the tables can be cross-matched e.g. using the AllWISE ID.
However, during the cross-matching problems arose:
len(SnS_inter) - len(SnS_final)
= 271
: this many sources are missing from final (okay)len([x for x in SnS_inter['ALLW_ID'].to_numpy() if x not in SnS_final['ALLW_ID'].to_numpy()])
= 351
: should be also 271, as all sources in inter that are not in final should be length difference if final is a subsamplelen([x for x in SnS_final['ALLW_ID'].to_numpy() if x not in SnS_inter['ALLW_ID'].to_numpy()])
= 82
: should be 0 if final was a subsamplePrinting all AllWISE IDs, sorting them lexicographically and performing a diff shows that the problem does not (at least entirely) come from artificial artifacts (like whitespace in the strings etc.). This problem does not affect a lot of sources, but it casts doubt on the entire dataset.
Intermediary and final data is still treated independently now. Both datasets have a cleaned-up pandas dataframe with standardized (using RobustScaler
) data columns as well, and are saved to pickle files, ready to be used by other notebooks.
The previous oddity between intermediary and final data was not yet dealt with, though.
Currently, Jupyter notebooks are too long, which causes problems:
Thus:
SarahSofia_checks.ipynb
.SarahSofia_checks.ipynb
separate the data loading and first investigation part from the rest. Make the data loading a one-click thing (not 20 different cells to be executed) that can be used in other notebooks that investigate the data further.