pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
21 stars 7 forks source link

Questions regarding Query IDs construction #55

Closed yaada100 closed 2 years ago

yaada100 commented 2 years ago

Dear Pauline and Robert,

I hope it is alright to ask these questions on here.

I have a question regarding the output. The code results in a folder called SIFT_predictions, with alignment fasta files. If I’ve understood it correctly those sequences are with Sw aligned, so pairwise and locally with the query sequence. The identity of the sequences are listed, but what about the query sequence? There is only the ID in the file name (i.e. : ENST000063518).

  1. Can you clarify how those ids come to be?
  2. The sequences are aligned locally, so is it possible to get access to the full query sequence, just with its „ID“?
  3. The SIFT.prediction files provide us with position specific sift scores for the newly constructed queries, how is this information significant to us? Can it be closed back to the original query? Or is the SIFT4g annotator the sole possibility to get position specific sift scores for the query sequence?

Thanks in advance, Yakup

pauline-ng commented 2 years ago

Hi Yakup,

I'll explain the results in the context of one of the test files: perl make-SIFT-db-all.pl --config test_files/candidatus_carsonella_ruddii_pv_config.txt

Run the above command, and you'll generate a bunch of files. One example file: SIFT_prediction/BAF35125.aligned.fasta

  1. BAF35125 is a transcript id, obtained from the gtf file ingene-annotation-src/Candidatus_carsonella_ruddii_pv.ASM1036v1.34.gtf.gz
  2. The first sequence in the file SIFT_prediction/BAF35125.aligned.fasta labelled "QUERY" corresponds to the BAF35125 protein sequence. It is the full query sequence.
  3. You can use the SIFT prediction files if you're working in protein coordinates -- just look up the .SIFTprediction file and look up the amino acid substitution for the corresponding protein.

Best, Pauline

yaada100 commented 2 years ago

Hey Pauline,

first of all thank you for your fast reply. To point 2. and 3. what i meant with query is the original query. So in context of candidatus I meant Candidatus_carsonella_ruddii_pv.ASM1036v1.dna.chromosome.Chromosome.fa as "Query". So can those results be closed back to this sequence.

Best, Yakup

pauline-ng commented 2 years ago

If we define the genomic sequence as "Query", then go to candidatus_carsonella_ruddii_pv/ASM1036v1.34/Chromosome.gz

In this file: Column 1 is the position in the query sequence Column 2: the reference DNA allele

SIFT information is in other columns.