malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
13 stars 23 forks source link

Add ENA run accessions function #496

Closed ahernank closed 5 months ago

ahernank commented 7 months ago

Adding a new function to load a data accession catalog. This follows the same logic as the function to access the SNP data catalog.

Currently, only ENA run accessions are retrieved but in the future, we might want to include other accessions.

codecov[bot] commented 7 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (6aef71a) 98.79% compared to head (8b7f16c) 98.79%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #496 +/- ## ======================================= Coverage 98.79% 98.79% ======================================= Files 35 35 Lines 3474 3484 +10 ======================================= + Hits 3432 3442 +10 Misses 42 42 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ahernank commented 7 months ago

Thanks @alimanfoo.

Could we rename the function to wgs_run_accessions() so it's a little more self-descriptive.

Yes, happy to. My reasoning with the original naming, was that at some point we might also want to store other accession types (e.g. sample, project accessions..).

Also not directly related to this PR but I'm noticing that the underlying CSV files include some columns which duplicate data from the general metadata (contributor, country, etc.). Would it be possible to strip back so these files just contain sample_id and run_ena columns?

My logic was to make these on the same way as the CSVs for wgs_snp_data, so if they were required as a solo file, they could be downloaded directly from GCS. Happy to strip down if cleaner, though.

alimanfoo commented 7 months ago

Could we rename the function to wgs_run_accessions() so it's a little more self-descriptive.

Yes, happy to. My reasoning with the original naming, was that at some point we might also want to store other accession types (e.g. sample, project accessions..).

Cool, I'd say scope it explicitly to run accessions, also helps to make clear the data should contain one row per run.

alimanfoo commented 7 months ago

Also not directly related to this PR but I'm noticing that the underlying CSV files include some columns which duplicate data from the general metadata (contributor, country, etc.). Would it be possible to strip back so these files just contain sample_id and run_ena columns?

My logic was to make these on the same way as the CSVs for wgs_snp_data, so if they were required as a solo file, they could be downloaded directly from GCS. Happy to strip down if cleaner, though.

Ah OK, I didn't realise we were also baking these extra fields into the wgs_snp_data files.

I'd say strip it back to only columns not present in other metadata files. That way it's simpler to handle if we ever need to fix metadata, corrections only need to be applied in one place.

ahernank commented 7 months ago

I'd say strip it back to only columns not present in other metadata files. That way it's simpler to handle if we ever need to fix metadata, corrections only need to be applied in one place.

Ahh, good point! Would it be okay to update to strip the columns for wgs_snp_data too (+ update the generation of this file)? Just realised that my manual months fix is actually incomplete as it I didn't include these files.

alimanfoo commented 7 months ago

I'd say strip it back to only columns not present in other metadata files. That way it's simpler to handle if we ever need to fix metadata, corrections only need to be applied in one place.

Ahh, good point! Would it be okay to update to strip the columns for wgs_snp_data too (+ update the generation of this file)?

Yeah definitely, sounds good.