simonhmartin / genomics_general

General tools for genomic analyses.
329 stars 92 forks source link

error transcripts codingSiteTypes #112

Open emgiles opened 5 months ago

emgiles commented 5 months ago

Hi I am getting an error when I try to run the codingSiteTypes.py. Can this only take a gff3 that includes one transcript per gene? The help option does not appear to give more information.

Any advice would be much appreciated. THANKS

simonhmartin commented 5 months ago

Hi, I think it can run with a gff that has any number of transcripts, but if different transcripts disagree on the status of a given site, the script will return an error. This can be avoised by including the option --ignoreConflicts

emgiles commented 5 months ago

Even using --ignoreConflicts, I only managed to get it to work with a gff3 including single isoforms. I am wondering if there is a way to retain the comparison for which variants return syn and non? If I have 100+ samples in my vcf, a variant is labeled as non if in at least one individual a non synonymous mutation occurs? What if in other individuals there is a syn mutation at that site?

simonhmartin commented 5 months ago

Oh, interesting. Do you get an error with multiple transcripts per gene?

Regarding sites with both syn and nonsyn variants: These would have to be triallelic, so probably a fairly small fraction of sites. Currently the script would report these as nonsynonymous variants. Specifically, what the script does is just considers the complete set of alleles in the REF and ALT columns of the vcf at each site. Then, given the three lists of alleles at the first, second, and third position in a given codon, it checks how many different amino acids would be possible, by trying all combinations. If more than one amino acid is possible at the the codon, the variable site in the codon is labelled as nonsynonymous. Note that codons where there are multiple variant sites are not evaluated, because this would make the calls of syn/nonsyn conditional on the allele at the other site.

The function doing this part looks like this

def synNon(pos1Alleles,pos2Alleles,pos3Alleles):
    output = ["NA","NA","NA"]
    #first check that each position has at least one allele and at most one position is variable
    nAlleles = [len(alleles) for alleles in (pos1Alleles,pos2Alleles,pos3Alleles,)]
    if not sorted(nAlleles) == [1,1,2]:
        return output
    else:
        #the variable site is the only one that will be labelled as syn/non
        focal = nAlleles.index(2)
        #check the number of possible amino acids given all possible codons
        l = len(possibleAAs(pos1Alleles,pos2Alleles,pos3Alleles))
        if l == 1: output[focal] = "syn"
        elif l > 1: output[focal] = "non"
    return output
emgiles commented 5 months ago

Hi Thanks for the info. The error I get is just this: KeyError: 'g10707.t1' I have tried different annotation files with more or less scaffolds and get the same error, though sometimes with different transcript names. The only gff3 that worked included only longest isoforms and post gxf2gxf (AGAT) run.

Trying to modify script to retain vcf header information and samples involved in comparisons...

simonhmartin commented 5 months ago

This could be solvable. Could you share the whole error message including the line of the script that is causing the problem?

emgiles commented 5 months ago

Hi Here is the full command and error: python genomics_general/codingSiteTypes.py -a /full_path/my_annotation_v1_ch10_top10.gff3 -f gff3 -r /full_path/jasmine_nosemi_10chrm.fasta -v /full_path/reseq_data/my_pileup_05jan2024_called_filtered2_fmiss50_allsites_10chrms_only10_hybridzone.vcf.gz --ignoreConflicts > debug.coding_site_types.tsv Parsing annotation Traceback (most recent call last): File "/full_path/genomics_general/codingSiteTypes.py", line 44, in geneData = genomics.parseGenes(ann.readlines(), fmt=args.format) File "/full_path/genomics_general/genomics.py", line 198, in parseGenes output[scaffold][mRNA]['exons'] += 1 KeyError: 'g10707.t1'

I have not figured out a way to retain the sample ID where syn/ non are identified, but I did try using a vcf with a subset of samples and I find the same number of syn and non sites as with the full dataset. Seems strange to me! I will try again with an even more limited dataset...