Question about CSV output

lcoombe commented 8 months ago

Hello,

Thanks again for your support!

I had a question about the output CSV format. With the HG00096 test data as an example, I was hoping to understand how to parse out the predicted ancestry from the CSV file, for example the 1kGP continental prediction.

This is the full CSV file I get:

,gnomAD_continental,gnomAD_eur,gnomAD_eas,1kGP_amr,1kGP_afr,1kGP_eas,1kGP_eur,1kGP_sas,1kGP_continental,SGDP_continental,total
HG00096,"[1.5122607237572083e-06, 1.8164928405894898e-06, 0.0009023747406899929, 3.244014976644394e-07, 0.9990336894989014, 6.034681427991018e-05]","[0.00805113 0.23736083 0.03074179 0.16823333 0.28606283 0.1723055
 0.09627828]",[3.68147158e-09 1.32139044e-07 1.88580978e-07],[0.53877832 0.00880409 0.00346894 0.00426701],"[3.00100449e-02 9.35040662e-06 5.62040194e-06 2.39125970e-06
 8.74307418e-07 7.38177234e-06 3.08565700e-01]",[0.00022368 0.00019657 0.00612237 0.00238336 0.0005246 ],[0.00024373 0.00114689 0.04632446 0.00023648],"[3.94933998e-02 9.17796740e-05 2.62362570e-04 7.24252364e-05
 8.75818598e-03]","[0.047951564379904596, 0.00945056338673766, 0.5553183556463094, 0.048678153274968626, 0.3386013633120798]","[0.46249270157454775, 0.017400954019737333, 0.036047890028784516, 0.05911003287218582, 0.2102181048640299, 0.06134952484458104, 0.15338079179613362]","[(['asj', 'eur'], [0.0009023747406899929, 0.9990336894989014]), (['nfe_bgr', 'nfe_onf'], [0.2373608311507276, 0.28606282852166487]), (['eas_kor', 'eas_oea'], [1.321390441822575e-07, 1.8858097797694505e-07]), (['Colombian in Medellin, Colombia', 'Puerto Rican in Puerto Rico'], [0.00880409001735807, 0.538778316364331]), (['African Caribbean in Barbados', 'African Ancestry in Southwest US'], [0.03001004487393047, 0.30856570029013597]), (['Han Chinese in Bejing, China', 'Kinh in Ho Chi Minh City, Vietnam'], [0.0023833555217097533, 0.0061223672763285625]), (['Finnish in Finland', 'Iberian populations in Spain'], [0.0011468928779831196, 0.04632445558650182]), (['Gujarati Indian in Houston,TX', 'Punjabi in Lahore,Pakistan'], [0.008758185979881379, 0.039493399814332544]), (['afr', 'amr'], [0.3386013633120798, 0.5553183556463094]), (['WestEurasia', 'Africa'], [0.2102181048640299, 0.46249270157454775])]"

Parsing the file, I see this for the 1kGP_continental probabilities:

"[0.047951564379904596, 0.00945056338673766, 0.5553183556463094, 0.048678153274968626, 0.3386013633120798]"

However, I was unsure from this file alone how to know which probability refers to which super-population? From the PDF plots, it looks like 0.56 refers to "AMR", and 0.34 refers to "AFR", but was having trouble seeing how to parse that info from the CSV alone.

Thank you! Lauren

andreirajkovic commented 8 months ago

Yes, the output is a bit weird. Here's how I might approach it, and I apologize for the format, this could do with a bit of a rewrite. The example below is taking the total column, which includes the top 2 recorded predictions from each model and then turning that into a dataframe that is Ancestry, Probability, and Model.

# Read in the csv
df = pd.read_csv("HG00096_extracted.vcf_HG00096.csv")

models = [
    "gnomAD_continental",
    "gnomAD_eur",
    "gnomAD_eas",
    "1kGP_amr",
    "1kGP_afr",
    "1kGP_eas",
    "1kGP_eur",
    "1kGP_sas",
    "1kGP_continental",
    "SGDP_continental"
]

# Function to parse the 'total' column and include models
def explode_total_column_with_models(row):
    items = ast.literal_eval(row)
    result = []
    for model, (ancestries, probabilities) in zip(models, items):
        for ancestry, prob in zip(ancestries, probabilities):
            result.append((ancestry, prob, model))
    return result

# Apply the function and create a list of tuples (ancestry, probability, model)
expanded_data_with_models = df['total'].apply(explode_total_column_with_models).explode().tolist()

# Create a new DataFrame from the expanded data
expanded_df_with_models = pd.DataFrame(expanded_data_with_models, columns=['Ancestry', 'Probability', 'Model'])

# Display the expanded DataFrame with models
print(expanded_df_with_models)

lcoombe commented 8 months ago

Thanks so much @andreirajkovic! That's super helpful!

A follow-up question - after getting a successful run with your provided HG00096 VCF, I moved on to using a VCF generated from a different 1000 genomes individual. The strange thing, is that the results look identical between the HG00096 test and this new individual - which seems wrong to me???

Any idea what could have gone wrong? This is my command:

usr/bin/time -pv singularity exec -B $PWD:$PWD,/path/to/data/resource_dir/:/path/to/data/resource_dir/ --env PYTHONPATH=/opt/Ancestry/:$PYTHONPATH --env TMP_DIR=$PWD/tmp_snv /path/to/data/bin/snvstory/3.0.1/snvstory_3.0.1.sif   python3 -m igm_churchill_ancestry --path ERR3242326_chr21_gatk.vcf --resource /path/to/data/data/resource_dir/ --genome-ver 38 --mode WGS --output-dir /path/to/work/dir/snv_story

The snv_story directory specified by --output-dir was not created, but I found the outputs in tmp_snv/137748c7ce844513b8c18241c281346a/output/

Here are the PDFs generated:

andreirajkovic commented 8 months ago

@lcoombe if you could upload the vcf here I could take a look, but my guess is that you're using a single chromosome ERR3242326_chr21_gatk.vcf and the model will kinda freak out that there are just a bunch of zeros for most of the features. If you provide a vcf that contains variants sampled across the whole genome should provide a much different view

lcoombe commented 8 months ago

Ah ok! I was just testing to make sure my own generated VCF worked OK, but that makes sense. Here's the VCF: ERR3242326_chr21_gatk.vcf.gz

But, I will generate the full genome VCF and see if that works better!

Thanks for your help! Lauren

lcoombe commented 8 months ago

An update - I generated the whole genome VCF, and I am strangely still getting identical results? (For the 1kGP predictions)

In case it helps, here's the full VCF I'm using: https://www.bcgsc.ca/downloads/btl/lcoombe/ERR3242326_gatk.vcf.gz (It was too big to attach here)

My command was the same as before - just using the full VCF instead.

Here's the PDF generated - looking the same as the other two from before:

Any other insights you might have would be much appreciated!

andreirajkovic commented 8 months ago

Based on my first crack at this, it looks like there are very few variants that are intersecting between our feature set ~55 (gnomad dataset) and your supplied vcf. I'll keep looking into this to see if there is a mistake somewhere on our side e.g. the features in the resource folder we uploaded is the wrong version, ect.

lcoombe commented 8 months ago

Ok great, thank you! Out of curiosity - do you get the same results with the HG00096 example VCF as I got above? Just thinking that could help narrow down if it's something with how I'm running snvstory vs. an issue with my sample's VCF itself?

andreirajkovic commented 8 months ago

That's a good question. The HG00096 example VCF has only 1000 random variants and so it has no intersecting variants with the ancestry specific variants. I did sample 600,000 variants and ran into the same issue. So now I need to do some deeper digging to figure out what is going on.

andreirajkovic commented 8 months ago

Okay I think I figured it out.. there is an issue with how we handle chr in front of the chromosome number e.g. chr1 vs 1. From what I recall the chr prefix was for hg38 alignments, but this doesn't seem to always be the case. A quick fix for you would be to add chr to each line of your vcf see the bash code below. Also this is what I get when I compute your sample:

ERR3242326.pdf

zcat ERR3242326_gatk.vcf.gz | awk 'BEGIN{FS=OFS="\t"} !/^#/{$1="chr"$1}1' | gzip > modified_file
_ERR3242326.vcf.gz

lcoombe commented 8 months ago

Ah ok! I'll give that a shot - what you're getting is somewhat what I'd expect for that sample. So hopefully if I make that formatting change, I'll get the same. I'll let you know - thank you!

lcoombe commented 8 months ago

That did the trick! With that renaming, I'm getting the same output as you, which is consistent with the sample's 1000 genomes ancestry label.

Thanks again for your help! Lauren

nch-igm / snvstory

Question about CSV output #11