stat-lab / EvalSVcallers

Evaluate the performances (precision and recall) of structural variation (SV) callers
32 stars 13 forks source link

Location of files used to generate simulated data? #3

Closed joshuak94 closed 4 years ago

joshuak94 commented 4 years ago

Hello!

In the supplementary files, there is a snippet of code which shows how simulated reads were generated:

python varsim.py--vc_in_vcf $HOME/tool/VarSim/All.vcf --sv_insert_seq $HOME/tool/VarSim/insert_seq.txt --sv_dgv $HOME/tool/VarSim/GRCh37_hg19_supportingvariants_2013-07-23.txt --reference hs37d5.fa--id Sim-A--read_length 125 --mean_fragment_size 500 --sd_fragment_size 100 --vc_num_snp 2700000 --vc_num_ins 270000 --vc_num_del 270000 --sv_num_ins 3000 --sv_num_del 4000 --sv_num_dup 2000 --sv_num_inv 500 --sv_percent_novel 0.2 --vc_percent_novel 0.01 --vc_min_length_lim 0 --vc_max_length_lim 29 --sv_min_length_lim 30 --sv_max_length_lim 1000000 --nlanes 1 --total_coverage 30 --simulator_executable

I was wondering what the following files were: All.vcf (is this Sim-A.SV.vcf?), insert_seq.txt, and GRCh37_hg19_supportingvariants_2013-07-23.txt

Additionally, is the file in Sim-A.diploid.fa.URL.txt the result of the above varsim.py call? So all I need to do to generate reads from there is to use this command:

$ART_DIR/art_illumina -sam -i Sim-A.diploid.fa -l 125 -f 15 -m 500 -s 100 -o Sim-A_30x -p -1 $ART_DIR/Illumina_profiles/HiSeq2500L125R1.txt -2 $ART_DIR/Illumina_profiles/HiSeq2500L125R2.txt

Thank you!

stat-lab commented 4 years ago

Dear Joshua,

We have used an older version (v.0.4.1.2 or earlier) of VarSim. According to the instruction in README, we downloaded these files (insert_seq.txt and GRCh37_hg19_supportingvariants_2013-07-23.txt) and used them as input for simulated SVs. The resulting simulated diploid human genome is Sim-A.diploid.fa, which is available at the indicated URL. If you use the latest version of VarSim, please follow the latest instruction. You would be able to generate simulated reads using the command you indicated.

Best,

2019/11/11 22:44、Joshua Kim notifications@github.comのメール:

Hello!

In the supplementary files, there is a snippet of code which shows how simulated reads were generated:

python varsim.py--vc_in_vcf $HOME/tool/VarSim/All.vcf --sv_insert_seq $HOME/tool/VarSim/insert_seq.txt --sv_dgv $HOME/tool/VarSim/GRCh37_hg19_supportingvariants_2013-07-23.txt --reference hs37d5.fa--id Sim-A--read_length 125 --mean_fragment_size 500 --sd_fragment_size 100 --vc_num_snp 2700000 --vc_num_ins 270000 --vc_num_del 270000 --sv_num_ins 3000 --sv_num_del 4000 --sv_num_dup 2000 --sv_num_inv 500 --sv_percent_novel 0.2 --vc_percent_novel 0.01 --vc_min_length_lim 0 --vc_max_length_lim 29 --sv_min_length_lim 30 --sv_max_length_lim 1000000 --nlanes 1 --total_coverage 30 --simulator_executable I was wondering what the following files were: All.vcf (is this Sim-A.SV.vcf?), insert_seq.txt, and GRCh37_hg19_supportingvariants_2013-07-23.txt

Additionally, is the file in Sim-A.diploid.fa.URL.txt the result of the above varsim.py call? So all I need to do to generate reads from there is to use this command:

$ART_DIR/art_illumina -sam -i Sim-A.diploid.fa -l 125 -f 15 -m 500 -s 100 -o Sim-A_30x -p -1 $ART_DIR/Illumina_profiles/HiSeq2500L125R1.txt -2 $ART_DIR/Illumina_profiles/HiSeq2500L125R2.txt Thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stat-lab/EvalSVcallers/issues/3?email_source=notifications&email_token=ADIBV3HPTMRDXTERP7FR6VTQTFOVJA5CNFSM4JLVYMOKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HYNM5AA, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADIBV3B6GM6MD5EIAM7PE3TQTFOVJANCNFSM4JLVYMOA.

joshuak94 commented 4 years ago

Thank you for your response!

To clarify, I could just download the Sim-A.diploid.fa file provided at the URL, and then use the ART simulator to generate a bam file from this fasta file?

Can I ask what I would use as my reference file, if I wanted to test my own SV caller with this? I noticed that because this is a diploid file, my hs37d5.fa file does not work as a reference (because the chromosome names do not match). Did you use a special version of hs37d5? Or did you just manually separate the bam file into a maternal and a paternal bam file?

stat-lab commented 4 years ago

Yes, the ART simulator should generate paired-end fastq read files with our provided Sim-A.diploid.fa. The reference hs37d5 we used was a standard one. If your ART simulator does not work, try older versions of ART or remove the decoy sequences from Sim-A.diploid.fa.

2019/11/12 17:35、Joshua Kim notifications@github.comのメール:

Thank you for your response!

To clarify, I could just download the Sim-A.diploid.fa file provided at the URL, and then use the ART simulator to generate a bam file from this fasta file?

Can I ask what I would use as my reference file, if I wanted to test my own SV caller with this? I noticed that because this is a diploid file, my hs37d5.fa file does not work as a reference (because the chromosome names do not match). Did you use a special version of hs37d5? Or did you just manually separate the bam file into a maternal and a paternal bam file?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stat-lab/EvalSVcallers/issues/3?email_source=notifications&email_token=ADIBV3HTZNPZTNISULI5I7LQTJTFZA5CNFSM4JLVYMOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDZOPLI#issuecomment-552789933, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADIBV3BSRDPWAQEARQPCMQLQTJTFZANCNFSM4JLVYMOA.

joshuak94 commented 4 years ago

Alright thank you!

I was able to obtain fq files from the ART simulator. Do you then just use bwa mem to align these to hs37d5? I'm a bit confused because in the command, a sam file was also generated.

stat-lab commented 4 years ago

We have aligned the simulated reads to hs37d5 with bwa mem.

2019/11/12 22:06、Joshua Kim notifications@github.comのメール:

Alright thank you!

I was able to obtain fq files from the ART simulator. Do you then just use bwa mem to align these to hs37d5? I'm a bit confused because in the command, a sam file was also generated.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stat-lab/EvalSVcallers/issues/3?email_source=notifications&email_token=ADIBV3ETE6EKRKSQRJVLUQDQTKS5JA5CNFSM4JLVYMOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOED2FXEA#issuecomment-552885136, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADIBV3GQFK2WH6UWGSQL6GTQTKS5JANCNFSM4JLVYMOA.

joshuak94 commented 4 years ago

Great, thank you!

Final question: I noticed in the Sim-A.SV.vcf file, that some variants have chromosome IDs which start with #. Example:

https://github.com/stat-lab/EvalSVcallers/blob/ffcc9585bc185bbccfe864365b4e80669a0d6e8f/Ref_SV/Sim-A.SV.vcf#L334-L350

What does this mean? Are these variants which were not actually simulated with VarSim?

stat-lab commented 4 years ago

The VarSim-simulated variants with ‘#’ are ones overlapping with the gap regions in the hs37d5 reference. These were annotated by us and excluded from the simulated SV list for the downstream analysis.

2019/11/14 5:45、Joshua Kim notifications@github.comのメール:

Great, thank you!

Final question: I noticed in the Sim-A.SV.vcf file, that some variants have chromosome IDs which start with #. Example:

https://github.com/stat-lab/EvalSVcallers/blob/ffcc9585bc185bbccfe864365b4e80669a0d6e8f/Ref_SV/Sim-A.SV.vcf#L334-L350 https://github.com/stat-lab/EvalSVcallers/blob/ffcc9585bc185bbccfe864365b4e80669a0d6e8f/Ref_SV/Sim-A.SV.vcf#L334-L350 What does this mean? Are these variants which were not actually simulated with VarSim?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stat-lab/EvalSVcallers/issues/3?email_source=notifications&email_token=ADIBV3EP277LZEWWJT4QLW3QTRRPNA5CNFSM4JLVYMOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOED7S4MY#issuecomment-553594419, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADIBV3E6RSIRZK652GWCKT3QTRRPNANCNFSM4JLVYMOA.

joshuak94 commented 4 years ago

Thanks for the help thus far, I have yet another question. I'm looking at long read data from the pacbio1 dataset (SRP043609 (SRR1947646–SRR1950271)). The supplementary files say the fastq files were aligned as such: ngmlr -r hs37d5.fa -q NA78.pacbio.fq -t 10 -o $out_prefix.sam --no-smallinv

But from what I can see, the accession numbers above lead to 7 fastq.gz files, not one file like in the command.

Were all of the fastq files concatenated before aligning?

stat-lab commented 4 years ago

Yes, the fast files were concatenated to be aligned.

2019/11/15 22:30、Joshua Kim notifications@github.comのメール:

Thanks for the help thus far, I have yet another question. I'm looking at long read data from the pacbio1 dataset (SRP043609 (SRR1947646–SRR1950271)). The supplementary files say the fastq files were aligned as such: ngmlr -r hs37d5.fa -q NA78.pacbio.fq -t 10 -o $out_prefix.sam --no-smallinv

But from what I can see, the accession numbers above lead to 7 fastq.gz files, not one file like in the command.

Were all of the fastq files concatenated before aligning?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stat-lab/EvalSVcallers/issues/3?email_source=notifications&email_token=ADIBV3AXDUC6DGLKNNCPVATQT2QADA5CNFSM4JLVYMOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEFNRWQ#issuecomment-554359002, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADIBV3EEKRB6TCA4P6IPRVTQT2QADANCNFSM4JLVYMOA.