Closed cheninouc closed 1 year ago
Dear @cheninouc
Thanks for pointing this out. This is indeed unsafe coding.
The propose of line52 (and the following if
sentence) is to check wether the ${pre}.rmNPlant.ID
is already correctly generated. If it was not pre-generated or not correctly generated, the num_rmNPlant
would all be 0
. So the code did what it meant to do.
However, if the ${pre}.rmNPlant.ID
file was not exist before, there would be a No such file or directory
system error, which may introduce unpredictable behaviors in the future (e.g., if the script is executed with errexit
on). So I have modify line 52 to tell if the file exists first (see modified version here).
Thanks again for your suggestion.
Best wishes,
Songtao Gui
Dear @songtaogui :
Thank you for your prompt response. I know the propose of line52 (and the following if sentence) is to check wether the ${pre}.rmNPlant.ID is already correctly generated. Should the command on line 52 be (num_rmNPlant=$ (cat ${pre} .rmNPlant.ID | wc-l) instead of ( num_rmNPlant=$ (cat ${pre}. RmNPlant.ID-only | wc-l)?
Thanks again for your response. Best Regards, Chenguang Chen
Oops, my bad.
You are right, Line 52 should be num_rnNPlant=$(cat ${pre}.rmNPlant.ID | wc -l)
, and I have now changed them accordingly.
Thx a million.
It is my pleasure Your work is worth learning from and I hope we can have more exchanges in the future.
Best Regards, Chenguang Chen
Sure thing, please contact me if there are any question or interest thought, I could also be reached through email (songtaogui@sina.com). Glad to know my work is helpful : )
Dear @songtaogui I'm so sorry to bother you again.
When I tested pan02_blastNT_clean.sh, I think that you removed pollution through the accession. Should blastdbcmd -db nt -entry all -outfmt "%g,%l,%T" be blastdbcmd -db nt -entry all -outfmt "%a,%l,%T" ? In addition, is the 100 bp WGS PE reads suitable for this pipeline?
Looking forward to your reply Best Regards, Chenguang Chen
@cheninouc
Thanks again for pointing out my mistakes, and sorry for the inconvenience I have brought to you with my poorly-written README. It should be %a
instead of %g
, and I have corrected the readme accordingly.
Please also find another repository of mine ZMP_contam_est which estimated potential contaminations of certain species (as input taxid), in that repo, I do use %a
as the identifer.
Sorry again for all the inconvenience I have brought.
All the best,
Songtao Gui
Ops, almost forgot, 100 bp PE reads would be suitable for this blast-based-cleaning pipeline since BLAST is quite sensitive. However, the downstream assembly continuity could not be guaranteed with short reads only, especially when the genome is high repeatitive.
I'm a inexperienced person. Your answer is greatly appreciated.
Best Regards, Chenguang Chen
Dear @songtaogui
I looked at PANZ_individual_pipe.sh and found -f option which can assign other reference level assemblies. You said in your article that the reference level assemblies were split into scaftigs before mapping. Is there any way to achieve this step? In additon, was the scaftigs aligned with the reference genome using BWAMEM to get ordered bam files?
Looking forward to your reply Best Regards, Chenguang Chen
the PANZ_individual_pipe.sh
script wrapped quast5 for whole genome alignments (see pan01_quast_pre-unaln.sh), which used minimap2
as the actual aligner.
The reason that I split reference-level assemblies into scaftigs is that minimap2 aligning between two long genome sequences usually require large RAM (larger than my working resources could afford). If you have enough RAM (maybe 300~400Gb of RAM for alignment between two 2GB genomes, haven't actually tested), or your genome is not that big, you may not split the reference genomes.
If you do want to split into scaftigs, I would recommend tanghaibao's JCVI, a Swiss army knife for genomic operations. You may find the format
submodule of JCVI helpful.
best wishes
I see. Thank you very much for your patient answer.
best wishes
@cheninouc Sorry for the late reply, tough week. the popins pipe was used to get the split-reads and/or read-pairs supports of NGS short reads for the break ends of the NRSs. For WGA data, the NRSs was assumed to be credible and tagged as "REF".
Also, the pan02_blastNT_clean.sh may not be necessary for WGA based NRSs.
Please also refer to panz_anchor_ref.sh
Best,
sontao gui
I am very glad to receive your reply. I was careless and didn't see the script panz_anchor_ref.sh before.
Best Regards, Chenguang Chen
Dear @songtaogui
When I tested pan10_bed_cls.sh,I found a possible problem with lines 245 and 246 of this script --(cat .clstr | pigz -p $threads > ${workdir}/${pre}_05_All_cdhit_clstr.tsv.gz && cat .fa | pigz -p $threads > ${workdir}/${pre}_05_All_cdhit_sequences.fa.gz ) I think these two lines of command should be (cat split_mttfacdhit_out/.tsv | pigz -p $threads > ${workdir}/${pre}_05_All_cdhit_clstr.tsv.gz && cat split_mttfacdhit_out/.fa | pigz -p $threads > ${workdir}/${pre}_05_All_cdhit_sequences.fa.gz) First, maybe you forgot to add the path. In addtion, the file (cat plit_mttfacdhit_out/.clstr | pigz -p $threads > ${workdir}/${pre}_05_All_cdhit_clstr.tsv.gz) cannot be analyzed downstream. However, replacing .clstr with .tsv could get ref record ID of each clstr.
Looking forward to your reply Best Regards, Chenguang Chen
Dear @songtaogui : I found a question in this bash file. ${pre}.rmNPlant.ID-only is generated in line 55(cut -f 1 ${pre}.rmNPlant.ID > ${pre}.rmNPlant.ID-only) How does this assign a value to the variable num_rmNPlant(the line 52: num_rmNPlant=$(cat ${pre}.rmNPlant.ID-only | wc -l))?