lczech / grenedalf

Toolkit for Population Genetic Statistics from Pool-Sequenced Samples, e.g., in Evolve and Resequence experiments
GNU General Public License v3.0
33 stars 2 forks source link

error regarding vcf file from PoolSNP #8

Closed JennyCNS closed 3 months ago

JennyCNS commented 1 year ago

Hello,

I am trying to run grenedalf frequency on my vcf file generated by PoolSNP from an mpileup file using

GRENEDALF=/gxfs_home/geomar/smomw573/software/grenedalf/bin/grenedalf
VCF=/gxfs_home/geomar/smomw573/work/seasonal_adaptation/analysis/PoolSNP/finalfile.vcf.gz
GENOME=/gxfs_work1/geomar/smomw573/seasonal_adaptation/genome/Eaffinis.Baltic.PseudoRef.Mar22.fasta \
OUTPUT=/gxfs_home/geomar/smomw573/work/seasonal_adaptation/analysis/grenedalf-freq

$GRENEDALF frequency --vcf-path $VCF --allow-file-overwriting --reference-genome-fasta-file $GENOME --write-sample-alt-freq > $OUTPUT/grenedalftrial.log

and got the following error:

Cannot iterate over VCF file /gxfs_home/geomar/smomw573/work/seasonal_adaptation/analysis/PoolSNP/finalfile.vcf using the "AD" FORMAT field to count allelic depths, as that field is not part of the VCF file.

PoolSNP output vcf head

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  EA_2009_T1      EA_2009_T2      EA_2009_T3      EA_2009_T4      EA_2011_T1      EA_2011_T2    EA_2015_T1      EA_2015_T2      EA_2015_T3      EA_2015_T4      EA_2022_T1      EA_2022_T2      EA_2022_T3      EA_2022_T4
TRINITY_DN17_c1_g1_i1   185     .       G       A       .       .       ADP=82.42857142857143;NC=0      GT:RD:AD:DP:FREQ        ./.:.:.:.:.     0/1:51:1:52:0.02      0/1:71:3:74:0.04        0/1:94:3:97:0.03        0/1:64:2:66:0.03        0/1:107:2:109:0.02      0/1:109:2:111:0.02      0/0:91:0:91:0.0 0/1:48:2:50:0.04      0/0:96:0:96:0.0 0/1:99:2:101:0.02       0/1:92:1:93:0.01        0/1:79:3:82:0.04        0/1:100:2:102:0.02
TRINITY_DN17_c1_g1_i1   190     .       C       G       .       .       ADP=83.92857142857143;NC=0      GT:RD:AD:DP:FREQ        ./.:.:.:.:.     0/1:48:3:51:0.06      0/1:73:4:77:0.05        0/1:96:4:100:0.04       0/1:67:1:68:0.01        0/1:105:5:110:0.05      0/1:108:2:110:0.02      0/1:92:3:95:0.03        0/1:48:2:50:0.04      0/1:94:4:98:0.04        0/1:104:2:106:0.02      0/1:87:4:91:0.04        0/1:82:3:85:0.04        0/0:102:0:102:0.0

The vcf file was then formated with the following python script:

# Open the VCF file in read mode
with gzip.open(vcf_file_path, 'rt', encoding='utf-8') as vcf_file:
    # Iterate through each line in the file
    with open(output_file_path, 'a', encoding='utf-8') as output_file:
        line_number = 0

        for line in vcf_file:
            if line.startswith('#'):
                continue

            # Increment the line number (Move this line outside the loop)
            # line_number += 1
            # Split the line into fields based on tabs
            fields = line.strip().split('\t')
            # 9th column forward
            fsplit = fields[8].split(':')
            modified_fields = ':'.join([fsplit[0], fsplit[2], fsplit[3], fsplit[4]])
            fields[8] = modified_fields

            idx = 9
            for fcolumn in fields[9:]:
                tmpsplit = fcolumn.split(':')
                tmpjoin = ':'.join([tmpsplit[0], ','.join([tmpsplit[1], tmpsplit[2]]), ':'.join([tmpsplit[3], tmpsplit[4]])])
                fields[idx] = tmpjoin
                idx += 1

            # Write the modified line to the output file
            output_file.write('\t'.join(fields) + '\n')

            line_number += 1
            #print(line_number)

            # Print line number every 50,000 lines
            if line_number % 50000 == 0:
                print(f"Processed {line_number} lines.")

and we corrected the AD field in the vcf header

##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

and now grenedalf frequency in running.

Many thanks,

Jennifer

lczech commented 1 year ago

Hi Jennifer,

thanks for reporting this. As far as I can tell, you already solved this?! Great!

However, I do have some questions for clarification. Also, for future reference, and maybe to improve the situation for other users, I'll try to recap your approach. The file snippets that you posted look like the following is happening:

However, as far as I can tell, the AD FORMAT field is non-standard - it does not appear in the VCF specification. When implementing this in grenedalf, I've hence tried to find out what people mostly do, see here and here for instance. That second link though seems to tell us that that DP is not exactly ref + alt allele count - it might differ depending on filter settings... But, the way it's implemented in PoolSNP, it's the sum, so we are good here. Also, given that AD is non-standard, it can be used by each program as they see fit - and PoolSNP seems to report it in a non-typical way, as a single number, and you changed that back to the more typical way of using that field.

Is my interpretation of your approach correct? Also, do you have any suggestions on how do improve the situation? If time permits, I could implement the PoolSNP way of using the AD field in grenedalf as well, if that helps. Maybe @capoony (the author of PoolSNP) wants to comment on this for clarification as well? In particular to check if I understood the PoolSNP VCF format correctly.

Cheers, thanks, and so long Lucas

lczech commented 1 year ago

Oh also, are you planning to run any other grenedalf tools, other than the frequency command? Because that one literally only gives you the information that you already have there from the RD and AD fields, so you could as well compute that with your script there directly ;-)

lczech commented 1 year ago

Hey @JennyCNS,

any news on this? I am not sure what I can do to help with this - would it help to be more lenient with respect to the requirement of the "AD" FORMAT field being present in the VCF header? Maybe just warn about it, but as long as the actual lines in the file contain an AD annotation, it should still work? What do you think?

Cheers Lucas

lczech commented 3 months ago

Hi @JennyCNS,

just wanted to give a brief heads-up about the latest release grenedalf v0.5.0. It has several improvements that might be relevant for you. Also, I now wrote some additional documentation about VCF input, see here.

Hope that helps. Going to close the issue now, but feel free to re-open or start a new one if you have more questions!

Cheers and so long Lucas