jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data
MIT License
4 stars 3 forks source link

Record HMM details for every sample #176

Closed jeromekelleher closed 3 months ago

jeromekelleher commented 1 year ago

Keep the list of mutations and original copying node for every sample as qc metadata. It gets too hard reasoning about this after the fact, when layers of tree building are place on top.

szhan commented 3 months ago

This is a working example:

{
    "Artic_primer_version": ".",
    "Collection_date": "2020-01-01",
    "Country": "India",
    "Date_tree_order": "Date",
    "Experiment": "SRX10971427",
    "Genbank_N": ".",
    "Genbank_accession": ".",
    "Genbank_other_runs": ".",
    "Genbank_pangolin": ".",
    "Genbank_scorpio": ".",
    "Genbank_tree_name": ".",
    "In_Viridian_tree": "T",
    "In_intersection": "F",
    "Platform": "ION_TORRENT",
    "Region": "Vadodara, Gujarat",
    "Sample": "SAMN19315526",
    "Study": "PRJNA625669",
    "Viridian_N": 3,
    "Viridian_amplicon_scheme": "COVID-AMPLISEQ-V1",
    "Viridian_cons_het": 0,
    "Viridian_cons_len": 29832,
    "Viridian_pangolin": "B.1.36.29",
    "Viridian_result": "PASS",
    "Viridian_scorpio": ".",
    "date": "2020-01-01",
    "date_submitted": "2021-05-25",
    "sc2ts": {
        "mutations": [
            "222C>T",
            "241C>T",
            "3037C>T",
            "3267C>T",
            "4683C>T",
            "5986C>T",
            "6471C>T",
            "9870C>T",
            "10486A>G",
            "14408C>T",
            "14652T>C",
            "18877C>T",
            "21034C>T",
            "22444C>T",
            "22627G>A",
            "22882T>G",
            "23403A>G",
            "25563G>T",
            "26173G>T",
            "26735C>T",
            "27147G>T",
            "28183G>T",
            "28277T>C",
            "28854C>T"
        ],
        "path": [
            "parent=1:left=0-right=29904"
        ],
        "qc": {
            "num_masked_sites": 130,
            "original_base_composition": {
                "A": 8894,
                "C": 5466,
                "G": 5850,
                "N": 106,
                "T": 9587
            },
            "original_md5": "587f101275d647fd492027b0855e893f"
        }
    },
    "strain": "SRR14631544"
}
szhan commented 3 months ago

We should also include whether a mutation is a reversion or immediate reversion, I think.

Also, there is probably a prettier way to show the copying paths than the format "parent=1:left=0-right=29904".

jeromekelleher commented 3 months ago

I'd use a different format for the path all right, that would be awkward. Would probably be as well to store the mutations as objects, let's not worry about the storage too much.

szhan commented 3 months ago

Do you mean convert the MatchMutation and PathSegment objects to some form that the metadata encoder is okay with? I am trying this string repr. approach because I couldn't just pass in a list of MatchMutation and PathSegment objects to metadata JSON.

jeromekelleher commented 3 months ago

If you add an asdict function to them you should be able to output to json easily. Look for other asdict implementations

szhan commented 3 months ago

It looks like this now.

"sc2ts": {
    "mutations": [
        {
            "derived_state": "T",
            "inherited_state": "C",
            "is_immediate_reversion": false,
            "is_reversion": false,
            "site_id": 144,
            "site_position": 222
        },
        ...
        {
            "derived_state": "T",
            "inherited_state": "C",
            "is_immediate_reversion": false,
            "is_reversion": false,
            "site_id": 28472,
            "site_position": 28854
        }
    ],
    "path": [
        {
            "left": 0,
            "parent": 1,
            "right": 29904
        }
    ],
szhan commented 3 months ago

I thought I had to modify a test in test_inference.py, but the tests ran fine without it.