liangclab / HERA

Other
76 stars 21 forks source link

Use HERA without bionano #5

Open Axolotl233 opened 4 years ago

Axolotl233 commented 4 years ago

Hi!

Thanks for sharing this wonderful tools. I wanna use it to improve my assembly result, but i don't have bionano data, so i wonder even with out it, will HERA work and have final result too? if HERA could work without bionano data, please tell me which steps should be bypassed, i can run HERA manually.

Thank you~

liangclab commented 4 years ago

Hi you don't have to have BioNano to use HERA. Just ignore the steps using BioNano data.

Axolotl233 commented 4 years ago

Thank you

Chengzhi Liang notifications@github.com 于2019年11月29日周五 上午8:40写道:

Hi you don't have to have BioNano to use HERA. Just ignore the steps using BioNano data.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/liangclab/HERA/issues/5?email_source=notifications&email_token=AK3QIDDEB7LD3US4SUVKTYTQWBQJVA5CNFSM4JSAZ63KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFNTJSI#issuecomment-559625417, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK3QIDAUL6T44QG6K54UDTTQWBQJVANCNFSM4JSAZ63A .

SouthernCD commented 4 years ago

Hi @liangclab , Can you give a guide to let we known how to ignore the steps using BioNano data.

liangclab commented 4 years ago

HERA can run normally without BioNano data. Please make sure you have all the executable files downloaded and put in the right directory to run pipeline.sh, or you can run the Perl scripts manually (make sure you download the newest perl scripts since there are some problems in the old ones).

libenping commented 4 years ago

Hi@liangclab,I have obtained SuperContig.fasta in 10-Contig_Pairs, and the genome size is similar with Large_Contig.fasta after split the origin sequences. So I want to know that i don't have BioNano data, is SuperContig.fasta the final assembly result?

jiandanxie commented 4 years ago

Hi@liangclab, I meet the same question.I wanna use HERA to fill gaps in my assembly result , but SuperContig.fasta does not seem to be what I want .

jiandanxie commented 4 years ago

@libenping Have you solved this problem?

liangclab commented 4 years ago

Have you all read the new readme file? we have updated it. Maybe you can find your answers in it.

libenping commented 4 years ago

@liangclab @jiandanxie Thank you for reply. Yeah, now I think I have got the correct answer,it should cat the SuperContig.fasta and Small_Contig.fasta. Then I have another question, before using bionano data, it just improve the scaffold N50, and the contig N50 doesn't change, am I right? I have use my data and the test data, both show this phenomenon.

acalcino commented 4 years ago

Hi @liangclab . The workflow for running an assembly without bionano data is still very unclear to me. The only thing that is said in the readme and in the pipeline.sh file is that in regards to the $Enzyme variable, this should be 'neglected' if you don't have bionano data. The problem is that if this is left blank or is commented out, the script doesn't work as it is required as input for later scripts. It's also confusing because the variable $Bionano_Scaffolded_Contig and $Bionano_NonScaffolded_Contig are actually just contigs from the input assembly that have been filtered by size. If I am correct about this, then I don't know how this relates to bionano contact maps.

Working on the assumption that all is needed to run this pipeline without bionano data, I tried changing the input Enzyme sequence from the default GCTCTTC to NNNNNNN and to XXXXXXX for the test DJ datasets. With the default sequence I got a final assembly of 6021253 nt. With the N and X sequences I got 6020982 nt and 6021244. A side note, I modified the pipeline to work with Slurm workload manager and to use minimap2 instead of bwa mem so my results are slightly different to just running the default options but either way, how are these slight changes to assembly length explained by the changes to the Enzyme sequence?

liangclab commented 4 years ago

The different genome sizes might be caused by different reads being used on the paths (randomly selected). This happens when the nucleotide accuracy of reads is low and can be fixed by sequence correction.

acalcino commented 4 years ago

Ok, thanks for that answer but can you please let us know very specifically how to run the pipeline without bionano data. It seems to be that the only inputs to the test pipeline are Test_Genome.fasta and the Test_CorrectedPacbio.fasta files. There doesn't seem to be any way to enter BioNano output files (CMAP, COORD, XMAP, SMAP, BNX) files in to pipeline.sh and the $Bionano_Scaffolded_Contig and the $Bionano_NonScaffolded_Contig variables are actually just the long and short contigs filtered from the input Test_Genome.fasta file. Here are my questions.

  1. Why are the large and small contig files labelled as $Bionano_Scaffolded_Contig and $Bionano_NonScaffolded_Contig when they are just large and small contigs filtered from the input assembly which isn't necessarily scaffolded by BioNano maps?
  2. Should we replace the Enzyme sequence with Xs or Ns?
  3. If not, what should we do with the $Enzyme variable? What is meant by 'neglect this parameter'?
  4. In the paper you set Lse and Lme to 25kb and 800kb respectively for the rice, maize, human, and Tartary buckwheat genomes. Do Lse and Lme refer to $minimum_len and $large_len variables in the 01-Filter_Raw_Contig_By_Length script?

Thanks very much in advance. I'm really excited about getting this pipeline working on my own data!

sunnycqcn commented 4 years ago

Ok, thanks for that answer but can you please let us know very specifically how to run the pipeline without bionano data. It seems to be that the only inputs to the test pipeline are Test_Genome.fasta and the Test_CorrectedPacbio.fasta files. There doesn't seem to be any way to enter BioNano output files (CMAP, COORD, XMAP, SMAP, BNX) files in to pipeline.sh and the $Bionano_Scaffolded_Contig and the $Bionano_NonScaffolded_Contig variables are actually just the long and short contigs filtered from the input Test_Genome.fasta file. Here are my questions.

  1. Why are the large and small contig files labelled as $Bionano_Scaffolded_Contig and $Bionano_NonScaffolded_Contig when they are just large and small contigs filtered from the input assembly which isn't necessarily scaffolded by BioNano maps?
  2. Should we replace the Enzyme sequence with Xs or Ns?
  3. If not, what should we do with the $Enzyme variable? What is meant by 'neglect this parameter'?
  4. In the paper you set Lse and Lme to 25kb and 800kb respectively for the rice, maize, human, and Tartary buckwheat genomes. Do Lse and Lme refer to $minimum_len and $large_len variables in the 01-Filter_Raw_Contig_By_Length script?

Thanks very much in advance. I'm really excited about getting this pipeline working on my own data!

Hi, Did you solved your question? I met the same question. Thanks, Fuyou

shimiao12345 commented 4 years ago

Ok, thanks for that answer but can you please let us know very specifically how to run the pipeline without bionano data. It seems to be that the only inputs to the test pipeline are Test_Genome.fasta and the Test_CorrectedPacbio.fasta files. There doesn't seem to be any way to enter BioNano output files (CMAP, COORD, XMAP, SMAP, BNX) files in to pipeline.sh and the $Bionano_Scaffolded_Contig and the $Bionano_NonScaffolded_Contig variables are actually just the long and short contigs filtered from the input Test_Genome.fasta file. Here are my questions.

Why are the large and small contig files labelled as $Bionano_Scaffolded_Contig and $Bionano_NonScaffolded_Contig when they are just large and small contigs filtered from the input assembly which isn't necessarily scaffolded by BioNano maps? Should we replace the Enzyme sequence with Xs or Ns? If not, what should we do with the $Enzyme variable? What is meant by 'neglect this parameter'? In the paper you set Lse and Lme to 25kb and 800kb respectively for the rice, maize, human, and Tartary buckwheat genomes. Do Lse and Lme refer to $minimum_len and $large_len variables in the 01-Filter_Raw_Contig_By_Length script?

Thanks very much in advance. I'm really excited about getting this pipeline working on my own data!

hi, I have the same questions. Did you solved these questions? Thanks, Miao shi

acalcino commented 4 years ago

Hello @shimiao12345 @sunnycqcn No response yet from @liangclab but hopefully he can respond soon. I am thinking of forking this to show my edits to make Hera work with minimap2 and on a slurm system but until these issues regarding bionano data are resolved, I can't really make any more progress.

shimiao12345 commented 4 years ago

Hello @shimiao12345 @sunnycqcn No response yet from @liangclab but hopefully he can respond soon. I am thinking of forking this to show my edits to make Hera work with minimap2 and on a slurm system but until these issues regarding bionano data are resolved, I can't really make any more progress.

Thanks for your reply. Let us wait for @liangclab's response.