Open adswafford opened 4 years ago
Hi @adswafford Thank you for reporting this bug and providing very detailed information for debugging! I did some tests but could not replicate the error. But by reading my code I can roughly guess what could be the cause -- for one query sequence, some subject genomes cannot be assigned to a given rank while others can, causing None
values in the result. To solve this, I made a patch in a new branch austin
, although I cannot validate that it works. You may give it a try by updating the program with pip install -U git+https://github.com/qiyunzhu/woltka.git@austin
. Or you can wait a bit and let me try on the input files you provided.
Thanks! I just tried out the patch and got a different error:
(woltk) [adswafford@barnacle.ucsd.edu /projects/cmi_proj/blood_microbiome/niaid 01:26 PM]$ cat ~/sam_test.e1316955
Traceback (most recent call last):
File "/home/adswafford/miniconda3/envs/woltk/bin/woltka", line 8, in
Let me know if you want me to move the alignment files to a directory where you have access?
Progress after the second patch, but a new error:
(woltk) [adswafford@barnacle.ucsd.edu /projects/cmi_proj/blood_microbiome/niaid 01:37 PM]$ cat ~/sam_test.e1316957
Traceback (most recent call last):
File "/home/adswafford/miniconda3/envs/woltk/bin/woltka", line 8, in
7 directories, 101 files (woltk) [adswafford@barnacle.ucsd.edu /projects/cmi_proj/blood_microbiome/niaid 01:50 PM]$ cat combined/shogun/woltka/output_taxonomy.log Input directory: /projects/cmi_proj/blood_microbiome/niaid/combined/shogun/wol_alignments. Number of alignment files to read: 20. Number of alignment files to read: 20. Demultiplexing: off. Constructing classification system... Parsing taxonomy names file: /projects/cmi_proj/blood_microbiome/three_studies/taxonomy/names.dmp... Done. Parsing taxonomy nodes file: /projects/cmi_proj/blood_microbiome/three_studies/taxonomy/nodes.dmp... Done. Parsing simple map file: /projects/cmi_proj/blood_microbiome/three_studies/taxonomy/g2tid.txt... Done. Classification system constructed. Total number of classification units: 1669744. Classification will operate on these ranks: phylum, genus, species, free, none. Read-to-feature maps will be saved to: /projects/cmi_proj/blood_microbiome/niaid/combined/shogun/woltka/mapdir. Parsing alignment file CART001_Day_14-DNA_bowtie2_wol_alignment.sam .. Done. Number of query sequences: 116002. Parsing alignment file CART001_Day_14-RNA_bowtie2_wol_alignment.sam . Done. Number of query sequences: 89445. Parsing alignment file CART001_Day_30-DNA_bowtie2_wol_alignment.sam . Done. Number of query sequences: 48669. Parsing alignment file CART001_Day_30-RNA_bowtie2_wol_alignment.sam . Done. Number of query sequences: 102153. Parsing alignment file CART001_Day_60-DNA_bowtie2_wol_alignment.sam .. Done. Number of query sequences: 131112. Parsing alignment file CART001_Day_60-RNA_bowtie2_wol_alignment.sam . Done. Number of query sequences: 25457. Parsing alignment file CART001_Day_7-DNA_bowtie2_wol_alignment.sam . Done. Number of query sequences: 56918. Parsing alignment file CART001_Day_7-RNA_bowtie2_wol_alignment.sam . Done. Number of query sequences: 73395. Parsing alignment file CART001_Day_90-DNA_bowtie2_wol_alignment.sam . Done. Number of query sequences: 53435. Parsing alignment file CART001_Day_90-RNA_bowtie2_wol_alignment.sam . Done. Number of query sequences: 31346. Parsing alignment file Control01-DNA_bowtie2_wol_alignment.sam .. Done. Number of query sequences: 279521. Parsing alignment file Control02-DNA_bowtie2_wol_alignment.sam . Done. Number of query sequences: 116450. Parsing alignment file Control02-RNA_bowtie2_wol_alignment.sam .. Done. Number of query sequences: 269027. Parsing alignment file Control03-RNA_bowtie2_wol_alignment.sam .. Done. Number of query sequences: 302092. Parsing alignment file Control04-DNA_bowtie2_wol_alignment.sam .. Done. Number of query sequences: 231287. Parsing alignment file Control04-RNA_bowtie2_wol_alignment.sam .. Done. Number of query sequences: 252036. Parsing alignment file Control05-DNA_bowtie2_wol_alignment.sam .. Done. Number of query sequences: 306100. Parsing alignment file Control05-RNA_bowtie2_wol_alignment.sam .. Done. Number of query sequences: 439871. Parsing alignment file Control10-DNA_bowtie2_wol_alignment.sam . Done. Number of query sequences: 85534. Parsing alignment file Control13-RNA_bowtie2_wol_alignment.sam . Done. Number of query sequences: 142018. Task completed. Format of output feature table(s): BIOM.
Hi @adswafford Sorry for having you waiting. I am getting back to the program. Looks like the program managed to progress till the 2nd last step! The error message "Duplicate observation IDs" suggests that there are some duplicate taxon names. I guess this is because some taxon names in NCBI taxdump are duplicate.
To resolve this, you may remove --name-as-id
from the command line, so that the observation IDs in the BIOM table will still be taxon IDs, which are unique. meanwhile an extra metadata column will be appended to the table, listing corresponding names, which can have duplicates.
Alternatively, you can add --to-tsv
to the woltka command, and the output files will be in plain tab-delimited format, in which duplicate row headers are tolerated. This is not recommended but just for a quick check-up.
In my impression, the only case in NCBI taxonomy where two names are identical are phylum Actinobacteria (201174) and class Actinobacteria (1760). This is quite unfortunate. The instance Woltka could run into error is the free
rank classification, where a sequence can be assigned to any rank. If you remove that free
I guess Woltka will work as well.
I will work on the code to fix this issue as well as the other issue you reported later today.
Got it, thanks for the suggestions, explanations, and investigations. I dropped free and it seems to be running now, and I'll let you know if it hits another snag. Thanks!
On Fri, Apr 10, 2020 at 9:42 AM Qiyun Zhu notifications@github.com wrote:
Hi @adswafford https://github.com/adswafford Sorry for having you waiting. I am getting back to the program. Looks like the program managed to progress till the 2nd last step! The error message "Duplicate observation IDs" suggests that there are some duplicate taxon names. I guess this is because some taxon names in NCBI taxdump are duplicate. To resolve this, you may remove --name-as-id from the command line, so that the observation IDs in the BIOM table will still be taxon IDs, which are unique. meanwhile an extra metadata column will be appended to the table, listing corresponding names, which can have duplicates.
Alternatively, you can add --to-tsv to the woltka command, and the output files will be in plain tab-delimited format, in which duplicate row headers are tolerated. This is not recommended but just for a quick check-up.
In my impression, the only case in NCBI taxonomy where two names are identical are phylum Actinobacteria (201174 https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=201174&lvl=3&lin=f&keep=1&srchmode=1&unlock) and class Actinobacteria ([1760( https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=1760&lvl=3&lin=f&keep=1&srchmode=1&unlock)). This is quite unfortunate. The instance Woltka could run into error is the free rank classification, where a sequence can be assigned to any rank. If you remove that free I guess Woltka will work as well.
I will work on the code to fix this issue as well as the other issue you reported later today.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/qiyunzhu/woltka/issues/34#issuecomment-612113719, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGOEDBW37KKD5ZFYVWNXKGLRL5D55ANCNFSM4MEZ2C6Q .
New error when running taxonomy
Command: cd $tmp
in_dir=/projects/cmi_proj/blood_microbiome/niaid/combined out_root=$in_dir/shogun align_dir=$out_root/wol_alignments taxonomy=/projects/cmi_proj/blood_microbiome/three_studies/taxonomy function=/projects/wol/20170307/release/annotation
do gotus
echo 'starting woltk' conda activate woltk
echo 'woltk function for niaid'
make a directory for the output
out_dir=$out_root/woltka mkdir -p $out_dir
map_dir=$out_dir/mapdir func_dir=$out_dir/taxfunc
if [ ! -f $out_dir/niaid.woltk.fin ] then $(which time) woltka classify \ -i $align_dir \ --map $taxonomy/g2tid.txt \ --nodes $taxonomy/nodes.dmp \ --names $taxonomy/names.dmp \ --rank phylum,genus,species,free,none \ --name-as-id \ --outmap $map_dir \ -o $out_dir/taxonomy/ > $out_dir/output_taxonomy.log
Output log (woltk) [adswafford@barnacle.ucsd.edu /projects/cmi_proj/blood_microbiome/niaid 08:07 AM]$ cat combined/shogun/woltka/output_taxonomy.log Input directory: /projects/cmi_proj/blood_microbiome/niaid/combined/shogun/wol_alignments. Number of alignment files to read: 20. Number of alignment files to read: 20. Demultiplexing: off. Constructing classification system... Parsing taxonomy names file: /projects/cmi_proj/blood_microbiome/three_studies/taxonomy/names.dmp... Done. Parsing taxonomy nodes file: /projects/cmi_proj/blood_microbiome/three_studies/taxonomy/nodes.dmp... Done. Parsing simple map file: /projects/cmi_proj/blood_microbiome/three_studies/taxonomy/g2tid.txt... Done. Classification system constructed. Total number of classification units: 1669744. Classification will operate on these ranks: phylum, genus, species, free, none. Read-to-feature maps will be saved to: /projects/cmi_proj/blood_microbiome/niaid/combined/shogun/woltka/mapdir. Parsing alignment file CART001_Day_14-DNA_bowtie2_wol_alignment.sam
Error log: (woltk) [adswafford@barnacle.ucsd.edu /projects/cmi_proj/blood_microbiome/niaid 08:05 AM]$ cat ~/sam_test.e1316922 Traceback (most recent call last): File "/home/adswafford/miniconda3/envs/woltk/bin/woltka", line 8, in
sys.exit(cli())
File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/click/core.py", line 764, in call
return self.main(args, kwargs)
File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, ctx.params)
File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/click/core.py", line 555, in invoke
return callback(args, kwargs)
File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/woltka/cli.py", line 181, in classify
workflow(kwargs)
File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/woltka/workflow.py", line 109, in workflow
data = classify(
File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/woltka/workflow.py", line 243, in classify
assignreadmap(map, data, rank, sample, **kwargs)
File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/woltka/workflow.py", line 688, in assign_readmap
write_readmap(fh, asgmt, namedic)
File "/home/adswafford/miniconda3/envs/woltk/lib/python3.8/site-packages/woltka/file.py", line 333, in write_readmap
for taxon, count in sorted(taxa.items(), key=sortkey):
TypeError: '<' not supported between instances of 'NoneType' and 'str'
13.34user 2.32system 0:28.97elapsed 54%CPU (0avgtext+0avgdata 1175712maxresident)k
188560inputs+136outputs (270major+353025minor)pagefaults 0swaps
File structure: (woltk) [adswafford@barnacle.ucsd.edu /projects/cmi_proj/blood_microbiome/niaid 08:09 AM]$ tree combined/shogun/woltka/ combined/shogun/woltka/ ├── mapdir │ ├── free │ ├── genus │ ├── none │ ├── phylum │ │ └── CART001_Day_14-DNA_bowtie2_wol_alignment.txt.gz │ └── species └── output_taxonomy.log
Upstream files (generated by bowtie2 via SHOGUN: (woltk) [adswafford@barnacle.ucsd.edu /projects/cmi_proj/blood_microbiome/niaid 08:11 AM]$ ls -halS combined/shogun/wol_alignments/ total 1.1G -rw-r--r-- 1 adswafford knightlab 778M Apr 8 10:05 Control01-DNA_bowtie2_wol_alignment.sam -rw-r--r-- 1 adswafford knightlab 705M Apr 8 10:42 Control05-RNA_bowtie2_wol_alignment.sam -rw-r--r-- 1 adswafford knightlab 703M Apr 8 11:01 Control05-DNA_bowtie2_wol_alignment.sam -rw-r--r-- 1 adswafford knightlab 554M Apr 8 10:40 Control04-DNA_bowtie2_wol_alignment.sam -rw-r--r-- 1 adswafford knightlab 455M Apr 8 10:08 CART001_Day_14-DNA_bowtie2_wol_alignment.sam -rw-r--r-- 1 adswafford knightlab 440M Apr 8 10:20 Control03-RNA_bowtie2_wol_alignment.sam -rw-r--r-- 1 adswafford knightlab 431M Apr 8 10:17 Control02-RNA_bowtie2_wol_alignment.sam -rw-r--r-- 1 adswafford knightlab 426M Apr 8 10:45 Control04-RNA_bowtie2_wol_alignment.sam -rw-r--r-- 1 adswafford knightlab 418M Apr 8 11:09 CART001_Day_60-DNA_bowtie2_wol_alignment.sam -rw-r--r-- 1 adswafford knightlab 371M Apr 8 10:26 Control02-DNA_bowtie2_wol_alignment.sam -rw-r--r-- 1 adswafford knightlab 354M Apr 8 10:15 CART001_Day_14-RNA_bowtie2_wol_alignment.sam -rw-r--r-- 1 adswafford knightlab 282M Apr 8 10:44 CART001_Day_30-RNA_bowtie2_wol_alignment.sam -rw-r--r-- 1 adswafford knightlab 241M Apr 8 09:52 CART001_Day_90-DNA_bowtie2_wol_alignment.sam -rw-r--r-- 1 adswafford knightlab 191M Apr 8 10:14 Control10-DNA_bowtie2_wol_alignment.sam -rw-r--r-- 1 adswafford knightlab 189M Apr 8 10:18 Control13-RNA_bowtie2_wol_alignment.sam -rw-r--r-- 1 adswafford knightlab 183M Apr 8 11:11 CART001_Day_7-RNA_bowtie2_wol_alignment.sam -rw-r--r-- 1 adswafford knightlab 180M Apr 8 11:13 CART001_Day_7-DNA_bowtie2_wol_alignment.sam -rw-r--r-- 1 adswafford knightlab 164M Apr 8 10:27 CART001_Day_30-DNA_bowtie2_wol_alignment.sam -rw-r--r-- 1 adswafford knightlab 132M Apr 8 10:08 CART001_Day_90-RNA_bowtie2_wol_alignment.sam -rw-r--r-- 1 adswafford knightlab 81M Apr 8 11:12 CART001_Day_60-RNA_bowtie2_wol_alignment.sam drwxr-xr-x 2 adswafford knightlab 22 Apr 8 11:12 . drwxr-xr-x 6 adswafford knightlab 7 Apr 8 11:12 ..