sebastian-schindler / PhD

0 stars 0 forks source link

Clean up Jupyter notebooks #2

Open sebastian-schindler opened 6 months ago

sebastian-schindler commented 6 months ago

Currently, Jupyter notebooks are too long, which causes problems:

Thus:

sebastian-schindler commented 5 months ago

In the progress of cleaning up the data loading, a switch to pandas DataFrames instead of just numpy arrays is performed. For this, the generic helper function no_nan, which removes NaN values, has to be able to handle DataFrames. This makes sense, as DataFrames and numpy otherwise often work interchangeably (thus, it should work like this here as well).

sebastian-schindler commented 5 months ago
sebastian-schindler commented 5 months ago

Instead of dealing with final data in the same manner, it would be a good idea to just add the reconstructed obscuration information from the final data to the existing intermediary dataframe. This is a good idea, because then only one dataframe can be used for everything, and most of the data is identical anyways. The final dataset is used only because the reconstructed obscuration is only available there. The final dataset contains a few hundred sources less than the intermediary dataset (a fact we cannot change), but it should be a subsample, as evident from this flowchart from the paper: image To add the relevant obscuration column from final to intermediary, the tables can be cross-matched e.g. using the AllWISE ID.

However, during the cross-matching problems arose:

Printing all AllWISE IDs, sorting them lexicographically and performing a diff shows that the problem does not (at least entirely) come from artificial artifacts (like whitespace in the strings etc.). This problem does not affect a lot of sources, but it casts doubt on the entire dataset.

sebastian-schindler commented 5 months ago

Intermediary and final data is still treated independently now. Both datasets have a cleaned-up pandas dataframe with standardized (using RobustScaler) data columns as well, and are saved to pickle files, ready to be used by other notebooks.

The previous oddity between intermediary and final data was not yet dealt with, though.