mskcc / NeoAntigen-Pipeline

Pipeline for computing neoantigen qualities from DNA and RNA-Seq data
MIT License
1 stars 0 forks source link

investigate lens and find out what tools can be integrated into version 2 #22

Open nikhil opened 4 months ago

nikhil commented 4 months ago

https://gitlab.com/landscape-of-effective-neoantigens-software/nextflow/modules/tools/lens/-/wikis/home

nikhil commented 4 months ago

Investigate how it works with indels

nikhil commented 3 months ago

Currently we are looking at these two scripts and seeing what we are doing differently

https://github.com/taylor-lab/neoantigen-dev/blob/master/neoantigen.py

https://github.com/spvensko/lenstools/blob/main/lenstools.py

pintoa1-mskcc commented 3 months ago

Comparing how peptides are generated between lenstools and our neoantigen script, our script is a lot more elegant. Allison had asked me how we generate the peptides compared to lenstools, primarily asking if they translate out the entire peptide or not. Neither script retains the entire peptide for netMHC, both add some padding so additional aas are added to upstream and downstream of the indel.

Lenstools uses this workflow: lenstools.nf (readme, called by: lens.nf-> neos.nf --> seq_variation.nf ) to generate var_tx_seqs. Var_tx_seqs are individual FA files PER MUTATION, generated from bcftools. Command: bcftools consensus -H A --mark-del X --mark-ins lc -f bcftools ref. I tried to find the input FA file, but its nearly impossible with how they buried it. Its in samtools.nf as samtools_faidx_fetch_somatic_folder_parameters, however without knowing the inputs I cannot tell what is what. From demo_inputs I would guess it is just the whole genome FASTA file, not a cds FA file.

Actual generation of peptides is here: make_indel_peptides_context This script pulls the created FA file per variant as the FA sequence (selecting only cds regions). Then depending if the sequence is on the negative strand, it returns the reverse compliment of the sequence. Lenstools adds a 30 bp padding to conservative inframe indels (so 10 extra amino acids are added to both sides of the peptide from the indel, or if a stop codon is reached) and will translate frameshift indels from 30 bases upstream of the indel to the end of the peptide (or if a stop codon is reached). Script deletes any "X" created by bctools consensus and generated peptides off of the resulting sequence with Bio.Seq. This script never looks at the WT peptide sequence, entirely depends on bcftools consensus FA output.

Taylorlab script pulls the cds and cdna sequences from the provided reference files. Then utilizing the maf's HGVSc column, it alters the cds sequence by either inserting the insertion sequence or removing the reference allele for deletions at the position stated. Once the WT and MUT peptide sequences are identified, script manually converts entire sequence to amino acid sequences. Script then compares WT to MUT amino acid sequences, identifying location where AAs do not match. Script will then return the altered amino acid (plus 10 aas before and after altered region).

Result: Lenstools and taylorlab scripts both return peptides with a padding of 10 AAs. Primary difference appears to be flipping the sequence when the strand is negative, however I am unsure if that is due to the reference FASTA that lenstools inputs. It may not be a cds sequence as we are inputting.

pintoa1-mskcc commented 3 months ago

Discussed with mark, FA file is a genomic FASTA file, this is likely being done because prior to conversion of indel, germline variants are integrated. This may be something to look into further when we add germline variants to our pipeline.

Will continue investigating LENStools