pdxgx / neoepiscope

predicts neoepitopes from phased somatic mutations detected using tumor/normal DNA-seq data
Other
26 stars 17 forks source link

fail to include germline variant information #7

Open alex84425 opened 5 years ago

alex84425 commented 5 years ago

Hi, I try to generate noepeptide using this pipeline recently but encounter some difficulties.

firstly, I successfully run this pipeline with both somatic and germline variant which called by varscan2.

The command is showed below:

# merge varscan snp and indel variant, and I am not sure whether it is the correct to merge indel variant
cat ../../varscan_result/varscan_file_somatic_fetch_exon/001.vcf.*.Somatic.hc  |  grep -v ^\#  > 001.vcf.snp_indel.Somatic.hc
cat ../../varscan_result/varscan_file_Germline/001T.*.vcf > 001.vcf.snp_indel.Germline.hc
neoepiscope swap -i 001.vcf.snp_indel.Somatic.hc -o 001.vcf.snp_indel.Somatic.hc.sw
neoepiscope merge -g 001.vcf.snp_indel.Germline.hc  -s 001.vcf.snp_indel.Somatic.hc.sw  -o 001.merge.vcf

#hapcut2 need to sort variant
cat 001.merge.vcf   | sort -k1,1V -k2,2n  > 001.merge.sorted.vcf
mv 001.merge.sorted.vcf 001.merge.vcf

# phasing variant, and I am not sure that the difference between illumina read and 10X genomic, but I can only run correctly with the latter.
/home/alex2/git_file/HapCUT2/build/extractHAIRS  --bam ../../GATK_Recalibrator/001T.recal.bam  --VCF 001.merge.vcf  --out  001.merge.vcf.unlinked --10X 1 --indels 1
python3.6 /home/alex2/git_file/HapCUT2/utilities/LinkFragments.py  --bam ../../GATK_Recalibrator/001T.recal.bam   --VCF 001.merge.vcf --fragments 001.merge.vcf.unlinked  --out 001.merge.vcf.linked
/home/alex2/git_file/HapCUT2/build/HAPCUT2 --nf 1 --fragments 001.merge.vcf.linked  --VCF 001.merge.vcf  --output 001.merge.vcf.hp
# can not include germline info
neoepiscope prep -v 001.merge.vcf -c 001.merge.vcf.hp -o 001.merge.vcf.adhp
neoepiscope call -b hg19 -c 001.merge.vcf.adhp  -o 001.merge.vcf.out  -p netMHCpan 4 affinity -a HLA-A*24:02,HLA-A*24:02,HLA-B*54:01,HLA-B*40:02,HLA-C*03:04,HLA-C*01:02

However, I failed to run with somatic and germline variant which called by GATK pipeline. It shows some error in "neoepiscope call " step.

Traceback (most recent call last):
  File "/home/alex2/anaconda3/envs/exome-seq/bin/neoepiscope", line 10, in <module>
    sys.exit(main())
  File "/home/alex2/anaconda3/envs/exome-seq/lib/python3.6/site-packages/neoepiscope/__init__.py", line 765, in main
    protein_fasta=args.fasta,
  File "/home/alex2/anaconda3/envs/exome-seq/lib/python3.6/site-packages/neoepiscope/transcript.py", line 3149, in get_peptides_from_transcripts
    return_protein=True,
  File "/home/alex2/anaconda3/envs/exome-seq/lib/python3.6/site-packages/neoepiscope/transcript.py", line 2360, in neopeptides
    protein = seq_to_peptide(sequence[start_codon[0] :], reverse_strand=False)
  File "/home/alex2/anaconda3/envs/exome-seq/lib/python3.6/site-packages/neoepiscope/transcript.py", line 265, in seq_to_peptide
    codon = _codon_table[seq[i : i + 3]]
KeyError: 'G*A'

It seems that some wear character "*" appear in sequence, but I have no idea to solve this. I am willing to share my .bam file and vcf to you.

data link: https://drive.google.com/drive/folders/1O6PdYwImV0fEXDHPOemixV6cbVPhxatL?usp=sharing

maryawood commented 5 years ago

Hello, I'm sorry that you are having trouble! I have requested access to your google drive folder so I can take a look at the data and try to see what the problem is.

alex84425 commented 5 years ago

Hello, I got your requested access mail, and I press the button. Maybe you can try again.

Mary Wood notifications@github.com 於 2019年9月9日 週一 下午11:54寫道:

Hello, I'm sorry that you are having trouble! I have requested access to your google drive folder so I can take a look at the data and try to see what the problem is.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pdxgx/neoepiscope/issues/7?email_source=notifications&email_token=AGH424Q7S2WTHUVQDPMASWLQIZWUNA5CNFSM4IUTFU6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6IDRTI#issuecomment-529545421, or mute the thread https://github.com/notifications/unsubscribe-auth/AGH424VV5MZ7K3VEW5NXCGLQIZWUNANCNFSM4IUTFU6A .

-- Best,

國立交通大學生資所 碩士班學生 陸建利

Po-Yuan Chen, Master Student, Institute of Bioinformatics and Systems Biology, National Chiao Tung University

maryawood commented 5 years ago

Thank you, I am able to access the folder now! The file names in that directory do not match the ones you use in the commands. Could you tell me which file names in the google drive folder correspond to the file names in your commands?

alex84425 commented 5 years ago

Ok, I will try to rerun the command with new file name, so please wait for me.

Mary Wood notifications@github.com 於 2019年9月10日 週二 上午12:04寫道:

Thank you, I am able to access the folder now! The file names in that directory do not match the ones you use in the commands. Could you tell me which file names in the google drive folder correspond to the file names in your commands?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pdxgx/neoepiscope/issues/7?email_source=notifications&email_token=AGH424TQBF6SLAMJT3YDODTQIZX2DA5CNFSM4IUTFU6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6IEVUQ#issuecomment-529550034, or mute the thread https://github.com/notifications/unsubscribe-auth/AGH424TLQOIM7BZDVJQDK2TQIZX2DANCNFSM4IUTFU6A .

-- Best,

國立交通大學生資所 碩士班學生 陸建利

Po-Yuan Chen, Master Student, Institute of Bioinformatics and Systems Biology, National Chiao Tung University

alex84425 commented 5 years ago

Well, I think I found the problem after reruning the command.

Firstly, I have mixed the Somatic indel_snv called by varscan and germline indel_snv by mutect2 instead of GATK pipeline only I talked in Github. As you know, an error occur.

Problem caused by the variants called by mutect2 or HaplotypeCaller. The "ALT" column of vcf file contain the "," symbol, such as "G,GAA" in ALT column. Therefore, I just remove the record which contain this case and keep testing, and it seems works.

Sometime the problem will result in an error in HapCUT2 step or "neoepiscope call" but GATK_ReadBackedPhasing.

By the way, the command I run store in cmd.sh file.

陸建利 alex840425@gmail.com 於 2019年9月10日 週二 上午12:14寫道:

Ok, I will try to rerun the command with new file name, so please wait for me.

Mary Wood notifications@github.com 於 2019年9月10日 週二 上午12:04寫道:

Thank you, I am able to access the folder now! The file names in that directory do not match the ones you use in the commands. Could you tell me which file names in the google drive folder correspond to the file names in your commands?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pdxgx/neoepiscope/issues/7?email_source=notifications&email_token=AGH424TQBF6SLAMJT3YDODTQIZX2DA5CNFSM4IUTFU6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6IEVUQ#issuecomment-529550034, or mute the thread https://github.com/notifications/unsubscribe-auth/AGH424TLQOIM7BZDVJQDK2TQIZX2DANCNFSM4IUTFU6A .

-- Best,

國立交通大學生資所 碩士班學生 陸建利

Po-Yuan Chen, Master Student, Institute of Bioinformatics and Systems Biology, National Chiao Tung University

-- Best,

國立交通大學生資所 碩士班學生 陸建利

Po-Yuan Chen, Master Student, Institute of Bioinformatics and Systems Biology, National Chiao Tung University

maryawood commented 5 years ago

Which version of neoepiscope did you use for your analysis? The software should be able to handle cases where there are multiple alternate alleles as you described, but I would like to test out the commands using the same version as you so I can better see what happened

alex84425 commented 5 years ago

Well, I think that the version number is "0.3.5" by checking the "/home/alex2/anaconda3/envs/exome-seq/lib/python3.6/site-packages/neoepiscope/version.py"

Mary Wood notifications@github.com 於 2019年9月10日 週二 上午5:02寫道:

Which version of neoepiscope did you use for your analysis? The software should be able to handle cases where there are multiple alternate alleles as you described, but I would like to test out the commands using the same version as you so I can better see what happened

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pdxgx/neoepiscope/issues/7?email_source=notifications&email_token=AGH424W6WA6QM6J7ETBKYVDQI22XLA5CNFSM4IUTFU6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6JA53Q#issuecomment-529665774, or mute the thread https://github.com/notifications/unsubscribe-auth/AGH424XVESTWNABSGZ2JRDTQI22XLANCNFSM4IUTFU6A .

-- Best,

國立交通大學生資所 碩士班學生 陸建利

Po-Yuan Chen, Master Student, Institute of Bioinformatics and Systems Biology, National Chiao Tung University

alex84425 commented 5 years ago

So, is there any better way to deal with alternate alleles? It seems that "0.3.5" is the newest version.

陸建利 alex840425@gmail.com 於 2019年9月10日 週二 上午9:44寫道:

Well, I think that the version number is "0.3.5" by checking the "/home/alex2/anaconda3/envs/exome-seq/lib/python3.6/site-packages/neoepiscope/version.py"

Mary Wood notifications@github.com 於 2019年9月10日 週二 上午5:02寫道:

Which version of neoepiscope did you use for your analysis? The software should be able to handle cases where there are multiple alternate alleles as you described, but I would like to test out the commands using the same version as you so I can better see what happened

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pdxgx/neoepiscope/issues/7?email_source=notifications&email_token=AGH424W6WA6QM6J7ETBKYVDQI22XLA5CNFSM4IUTFU6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6JA53Q#issuecomment-529665774, or mute the thread https://github.com/notifications/unsubscribe-auth/AGH424XVESTWNABSGZ2JRDTQI22XLANCNFSM4IUTFU6A .

-- Best,

國立交通大學生資所 碩士班學生 陸建利

Po-Yuan Chen, Master Student, Institute of Bioinformatics and Systems Biology, National Chiao Tung University

-- Best,

國立交通大學生資所 碩士班學生 陸建利

Po-Yuan Chen, Master Student, Institute of Bioinformatics and Systems Biology, National Chiao Tung University

maryawood commented 5 years ago

Sorry for the delay, I had not had a chance to work on this yet! I just took a look today, and as suspected I got a similar error whether or not I retained variants with multiple alternate alleles. It appears that the issue is actually variants with '*' as the alternate allele, representing spanning deletions, which neoepiscope does not currently support. Thank you for bringing this to our attention! We will plan to incorporate a fix for this into an upcoming release of neoepiscope to increase the flexibility of the tool.

alex84425 commented 4 years ago

Sorry to bother you again. I try to reappear the result of this journal and encounter some problem. journal link: (https://www.nature.com/articles/nature14426) Before posing issue, is "neoepiscope" support mouse model by simply changing the bowtie1_index and gtf_file that program required.

Mary Wood notifications@github.com 於 2019年9月14日 週六 上午6:24寫道:

Sorry for the delay, I had not had a chance to work on this yet! I just took a look today, and as suspected I got a similar error whether or not I retained variants with multiple alternate alleles. It appears that the issue is actually variants with '*' as the alternate allele, representing spanning deletions, which neoepiscope does not currently support. Thank you for bringing this to our attention! We will plan to incorporate a fix for this into an upcoming release of neoepiscope to increase the flexibility of the tool.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pdxgx/neoepiscope/issues/7?email_source=notifications&email_token=AGH424XG7QMTKEKRKJHYWK3QJQHI3A5CNFSM4IUTFU6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6WKUFQ#issuecomment-531409430, or mute the thread https://github.com/notifications/unsubscribe-auth/AGH424XDA4MUTHB55BCLG7DQJQHI3ANCNFSM4IUTFU6A .

-- Best,

國立交通大學生資所 碩士班學生 陸建利

Po-Yuan Chen, Master Student, Institute of Bioinformatics and Systems Biology, National Chiao Tung University

maryawood commented 4 years ago

No bother at all! We don't currently have built-in support for genomes other than human hg19/hg38, but it's pretty easy to get things set up on your own to use a different genome/species. Instead of using the --build option when running neoepiscope call, you will use the --dicts and --bowtie-index options with some data you download (and process a bit) yourself.

If you'd like to use a mouse model instead of a human model, you can download the mouse GTF file for your genome build of choice from GENCODE: https://www.gencodegenes.org/mouse/

Using neoepiscope index, you can then index the GTF file to create pickled dictionaries necessary for predicting neoepitopes. This only needs to be done once. Then whenever you run neoepiscope call, you can use the --dicts option on the command line to specify the directory containing those pickled dictionaries.

Additionally, you will need a bowtie index for your mouse genome, which you can download from the bowtie website: ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/

You can use this index with the --bowtie-index option when running neoepiscope call

Hope this helps!

alex84425 commented 4 years ago

Me again XDD. Your suggestions works well. It seems that reference file from GENCODE is necessary instead of other version reference such as ensemble. In addition, I recently participate in the seminar, and I found that most speaker and research will include the normal epitope affinity into the result and compare them with each other. I think this function is worthy to add, and you just need to simply double run the affinity prediction tools.

Mary Wood notifications@github.com 於 2019年9月17日 週二 上午5:30寫道:

No bother at all! We don't currently have built-in support for genomes other than human hg19/hg38, but it's pretty easy to get things set up on your own to use a different genome/species. Instead of using the --build option when running neoepiscope call, you will use the --dicts and --bowtie-index options with some data you download (and process a bit) yourself.

If you'd like to use a mouse model instead of a human model, you can download the mouse GTF file for your genome build of choice from GENCODE: https://www.gencodegenes.org/mouse/

Using neoepiscope index, you can then index the GTF file to create pickled dictionaries necessary for predicting neoepitopes. This only needs to be done once. Then whenever you run neoepiscope call, you can use the --dicts option on the command line to specify the directory containing those pickled dictionaries.

Additionally, you will need a bowtie index for your mouse genome, which you can download from the bowtie website: ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/

You can use this index with the --bowtie-index option when running neoepiscope call

Hope this helps!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pdxgx/neoepiscope/issues/7?email_source=notifications&email_token=AGH424QKIEFAQGOISZT27LLQJ73ILA5CNFSM4IUTFU6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD62SSMI#issuecomment-531966257, or mute the thread https://github.com/notifications/unsubscribe-auth/AGH424VRS3QLK5NQSU4SCNLQJ73ILANCNFSM4IUTFU6A .

-- Best,

國立交通大學生資所 碩士班學生 陸建利

Po-Yuan Chen, Master Student, Institute of Bioinformatics and Systems Biology, National Chiao Tung University

maryawood commented 4 years ago

Thank you for the suggestion! That is something that we will probably add in the future. Also, the latest release of neoepiscope should be able to handle the spanning deletions you had issues with before, so hopefully that will no longer cause problems for you if you update your installation of neoepiscope!