songtaogui / pan-Zea_construct

Workflow to construct linear representation of pan genome from deep WGS data and public assemblies
MIT License
27 stars 5 forks source link

the problem of pan-Zea_construct/src/pan02_blastNT_clean.sh #1

Closed cheninouc closed 1 year ago

cheninouc commented 1 year ago

Dear @songtaogui : I found a question in this bash file. ${pre}.rmNPlant.ID-only is generated in line 55(cut -f 1 ${pre}.rmNPlant.ID > ${pre}.rmNPlant.ID-only) How does this assign a value to the variable num_rmNPlant(the line 52: num_rmNPlant=$(cat ${pre}.rmNPlant.ID-only | wc -l))?

songtaogui commented 1 year ago

Dear @cheninouc

Thanks for pointing this out. This is indeed unsafe coding. The propose of line52 (and the following if sentence) is to check wether the ${pre}.rmNPlant.ID is already correctly generated. If it was not pre-generated or not correctly generated, the num_rmNPlant would all be 0. So the code did what it meant to do. However, if the ${pre}.rmNPlant.ID file was not exist before, there would be a No such file or directory system error, which may introduce unpredictable behaviors in the future (e.g., if the script is executed with errexit on). So I have modify line 52 to tell if the file exists first (see modified version here).

Thanks again for your suggestion.

Best wishes,

Songtao Gui

cheninouc commented 1 year ago

Dear @songtaogui :

Thank you for your prompt response. I know the propose of line52 (and the following if sentence) is to check wether the ${pre}.rmNPlant.ID is already correctly generated. Should the command on line 52 be (num_rmNPlant=$ (cat ${pre} .rmNPlant.ID | wc-l) instead of ( num_rmNPlant=$ (cat ${pre}. RmNPlant.ID-only | wc-l)?

Thanks again for your response. Best Regards, Chenguang Chen

songtaogui commented 1 year ago

Oops, my bad.

You are right, Line 52 should be num_rnNPlant=$(cat ${pre}.rmNPlant.ID | wc -l), and I have now changed them accordingly.

Thx a million.

cheninouc commented 1 year ago

It is my pleasure Your work is worth learning from and I hope we can have more exchanges in the future.

Best Regards, Chenguang Chen

songtaogui commented 1 year ago

Sure thing, please contact me if there are any question or interest thought, I could also be reached through email (songtaogui@sina.com). Glad to know my work is helpful : )

cheninouc commented 1 year ago

Dear @songtaogui I'm so sorry to bother you again.

When I tested pan02_blastNT_clean.sh, I think that you removed pollution through the accession. Should blastdbcmd -db nt -entry all -outfmt "%g,%l,%T" be blastdbcmd -db nt -entry all -outfmt "%a,%l,%T" ? In addition, is the 100 bp WGS PE reads suitable for this pipeline?

Looking forward to your reply Best Regards, Chenguang Chen

songtaogui commented 1 year ago

@cheninouc Thanks again for pointing out my mistakes, and sorry for the inconvenience I have brought to you with my poorly-written README. It should be %a instead of %g, and I have corrected the readme accordingly.

Please also find another repository of mine ZMP_contam_est which estimated potential contaminations of certain species (as input taxid), in that repo, I do use %a as the identifer.

Sorry again for all the inconvenience I have brought.

All the best,

Songtao Gui

songtaogui commented 1 year ago

Ops, almost forgot, 100 bp PE reads would be suitable for this blast-based-cleaning pipeline since BLAST is quite sensitive. However, the downstream assembly continuity could not be guaranteed with short reads only, especially when the genome is high repeatitive.

cheninouc commented 1 year ago

I'm a inexperienced person. Your answer is greatly appreciated.

Best Regards, Chenguang Chen

cheninouc commented 1 year ago

Dear @songtaogui

I looked at PANZ_individual_pipe.sh and found -f option which can assign other reference level assemblies. You said in your article that the reference level assemblies were split into scaftigs before mapping. Is there any way to achieve this step? In additon, was the scaftigs aligned with the reference genome using BWAMEM to get ordered bam files?

Looking forward to your reply Best Regards, Chenguang Chen

songtaogui commented 1 year ago

the PANZ_individual_pipe.sh script wrapped quast5 for whole genome alignments (see pan01_quast_pre-unaln.sh), which used minimap2 as the actual aligner.

The reason that I split reference-level assemblies into scaftigs is that minimap2 aligning between two long genome sequences usually require large RAM (larger than my working resources could afford). If you have enough RAM (maybe 300~400Gb of RAM for alignment between two 2GB genomes, haven't actually tested), or your genome is not that big, you may not split the reference genomes.

If you do want to split into scaftigs, I would recommend tanghaibao's JCVI, a Swiss army knife for genomic operations. You may find the format submodule of JCVI helpful.

best wishes

cheninouc commented 1 year ago

I see. Thank you very much for your patient answer.

best wishes

songtaogui commented 1 year ago

@cheninouc Sorry for the late reply, tough week. the popins pipe was used to get the split-reads and/or read-pairs supports of NGS short reads for the break ends of the NRSs. For WGA data, the NRSs was assumed to be credible and tagged as "REF".

Also, the pan02_blastNT_clean.sh may not be necessary for WGA based NRSs.

Please also refer to panz_anchor_ref.sh

Best,

sontao gui

cheninouc commented 1 year ago

I am very glad to receive your reply. I was careless and didn't see the script panz_anchor_ref.sh before.

Best Regards, Chenguang Chen

cheninouc commented 1 year ago

Dear @songtaogui

When I tested pan10_bed_cls.sh,I found a possible problem with lines 245 and 246 of this script --(cat .clstr | pigz -p $threads > ${workdir}/${pre}_05_All_cdhit_clstr.tsv.gz && cat .fa | pigz -p $threads > ${workdir}/${pre}_05_All_cdhit_sequences.fa.gz ) I think these two lines of command should be (cat split_mttfacdhit_out/.tsv | pigz -p $threads > ${workdir}/${pre}_05_All_cdhit_clstr.tsv.gz && cat split_mttfacdhit_out/.fa | pigz -p $threads > ${workdir}/${pre}_05_All_cdhit_sequences.fa.gz) First, maybe you forgot to add the path. In addtion, the file (cat plit_mttfacdhit_out/.clstr | pigz -p $threads > ${workdir}/${pre}_05_All_cdhit_clstr.tsv.gz) cannot be analyzed downstream. However, replacing .clstr with .tsv could get ref record ID of each clstr.

Looking forward to your reply Best Regards, Chenguang Chen