Clean up Jupyter notebooks

sebastian-schindler commented 6 months ago

Currently, Jupyter notebooks are too long, which causes problems:

difficult to find things I did previously; hierarchy via markdown headings is insufficient, different notebooks would add another layer of seperation
large file size (50 MB) is a big problem when working on a remote Jupyter server (not currently, but maybe in the future, or for sharing with others)
Plotly 3D interactive plots are very laggy

Thus:

[ ] Split up large notebooks sensibly, especially SarahSofia_checks.ipynb.
[ ] In SarahSofia_checks.ipynb separate the data loading and first investigation part from the rest. Make the data loading a one-click thing (not 20 different cells to be executed) that can be used in other notebooks that investigate the data further.
[ ] Go over notebooks to remove old code, or update old things, or make clear that these are first investigations (kept as a kind of journal).
[ ] Change directory structure to have all notebooks in one folder without auxiliary data files (like pkl, npy, h5).

sebastian-schindler commented 5 months ago

In the progress of cleaning up the data loading, a switch to pandas DataFrames instead of just numpy arrays is performed. For this, the generic helper function no_nan, which removes NaN values, has to be able to handle DataFrames. This makes sense, as DataFrames and numpy otherwise often work interchangeably (thus, it should work like this here as well).

[ ] In this process it was found that plot_highdim removes all NaN values from the input data, without notifying the user. Is this actually sensible to do?

sebastian-schindler commented 5 months ago

[x] Allow plot_highdim to read labels from pandas DataFrame automatically.
[x] Save prepared DataFrame to a file for later use in other notebooks.
[x] Consider data standardization in data preparation.
[x] Deal similarly with final data.

sebastian-schindler commented 5 months ago

Instead of dealing with final data in the same manner, it would be a good idea to just add the reconstructed obscuration information from the final data to the existing intermediary dataframe. This is a good idea, because then only one dataframe can be used for everything, and most of the data is identical anyways. The final dataset is used only because the reconstructed obscuration is only available there. The final dataset contains a few hundred sources less than the intermediary dataset (a fact we cannot change), but it should be a subsample, as evident from this flowchart from the paper: To add the relevant obscuration column from final to intermediary, the tables can be cross-matched e.g. using the AllWISE ID.

However, during the cross-matching problems arose:

len(SnS_inter) - len(SnS_final) = 271: this many sources are missing from final (okay)
len([x for x in SnS_inter['ALLW_ID'].to_numpy() if x not in SnS_final['ALLW_ID'].to_numpy()]) = 351: should be also 271, as all sources in inter that are not in final should be length difference if final is a subsample
len([x for x in SnS_final['ALLW_ID'].to_numpy() if x not in SnS_inter['ALLW_ID'].to_numpy()]) = 82: should be 0 if final was a subsample

Printing all AllWISE IDs, sorting them lexicographically and performing a diff shows that the problem does not (at least entirely) come from artificial artifacts (like whitespace in the strings etc.). This problem does not affect a lot of sources, but it casts doubt on the entire dataset.

sebastian-schindler commented 5 months ago

Intermediary and final data is still treated independently now. Both datasets have a cleaned-up pandas dataframe with standardized (using RobustScaler) data columns as well, and are saved to pickle files, ready to be used by other notebooks.

The previous oddity between intermediary and final data was not yet dealt with, though.

sebastian-schindler / PhD

Clean up Jupyter notebooks #2