Diplotype clustering fails on sample sets without CNV calls

malariagen / malariagen-data-python

Analyse MalariaGEN data from Python

https://malariagen.github.io/malariagen-data-python/latest/

MIT License

13 stars 23 forks source link

Diplotype clustering fails on sample sets without CNV calls #555

Closed sanjaynagi closed 1 month ago

sanjaynagi commented 3 months ago

There are no CNV calls for this sample set so it fails. Earlier versions of the diplotype clustering function had a try/except statement to get around this, could implement something like that.

sanjaynagi commented 3 months ago

temporary workaround

sample_sets=[s for s in ag3.sample_sets()['sample_set'].to_list() if s != 'barron-2019']

alimanfoo commented 3 months ago

Seems fine to handle this error internally, but how do we communicate to the user that the data are missing? Need to make sure it doesn't look like these samples have normal copy number.

leehart commented 3 months ago

leehart commented 2 months ago

Would it be sufficient to simply skip sample sets that don't have CNV HMM data in the API?

Essentially:

                    y = self._cnv_hmm_dataset(
                        contig=r.contig,
                        sample_set=s,
                        inline_array=inline_array,
                        chunks=chunks,
                    )

                    # If no CNV HMM dataset was found then skip
                    if y is None:
                        continue

                    ly.append(y)

For example, I can submit a PR that would allow:

ag3.plot_diplotype_clustering_advanced(
  region='2L:28,535,000-28,552,000',
  cnv_region='2L:28,535,000-28,552,000',
  sample_query='taxon == "gambiae" and year > 2019',
  site_mask='gamb_colu_arab',
  color='taxon',
  snp_transcript='AGAP006227-RA',
)

newplot

leehart commented 1 month ago

We now skip sample sets that don't have CNV HMM data but still raise ValueError when no CNV HMM data are found at all.