milaboratory / mixcr

MiXCR is an ultimate software platform for analysis of Next-Generation Sequencing (NGS) data for immune profiling.
https://mixcr.com
Other
335 stars 79 forks source link

Error running importFromIMGT #66

Closed peterch405 closed 8 years ago

peterch405 commented 8 years ago

$ mixcr importFromIMGT Starting importFromIMGT.sh script By using this script you agree to the terms of use of IMGT website. (see http://www.imgt.org/ for details). Press ENTER to continue or other key to exit... Available species: (0) Bos taurus (1) Camelus dromedarius (2) Canis lupus familiaris (3) Cercocebus atys (4) Danio rerio (5) Homo sapiens (6) Macaca fascicularis (7) Macaca mulatta (8) Macaca nemestrina (9) Mus (10) Mus cookii (11) Mus minutoides (12) Mus musculus (13) Mus pahari (14) Mus saxicola (15) Mus spretus (16) Oncorhynchus mykiss (17) Ornithorhynchus anatinus (18) Oryctolagus cuniculus (19) Papio anubis anubis (20) Rattus norvegicus (21) Rattus rattus (22) Sus scrofa (23) Vicugna pacos Please select species (e.g. '5' for Homo sapiens): 5 You selected: Homo sapiens. Please enter a list of common species names for Homo sapiens delimited by ':' to be used in -s option in 'mixcr align ...' (e.g. 'hsa:hs:homosapiens:human'): 'hsa:hs:homosapiens:human' Getting taxonId for Homo sapiens from NCBI... Unknown option --xpath OK. TaxonId=Usage : xmllint [options] XMLfiles ... Parse the XML files and output the result of the parsing --version : display the version of the XML library used --debug : dump a debug tree of the in-memory document --shell : run a navigating shell --debugent : debug the entities defined in the document --copy : used to test the internal copy implementation --recover : output what was parsable on broken XML documents --huge : remove any internal arbitrary parser limits --noent : substitute entity references by their value --noout : don't output the result tree --path 'paths': provide a set of paths for resources --load-trace : print trace of all external entites loaded --nonet : refuse to fetch DTDs or entities over network --nocompact : do not generate compact text nodes --htmlout : output results as HTML --nowrap : do not put HTML doc wrapper --valid : validate the document in addition to std well-formed check --postvalid : do a posteriori validation, i.e after parsing --dtdvalid URL : do a posteriori validation against a given DTD --dtdvalidfpi FPI : same but name the DTD with a Public Identifier --timing : print some timings --output file or -o file: save to a given file --repeat : repeat 100 times, for timing or profiling --insert : ad-hoc test for valid insertions --compress : turn on gzip compression of output --html : use the HTML parser --xmlout : force to use the XML serializer when using --html --push : use the push mode of the parser --memory : parse from memory --maxmem nbbytes : limits memory allocation to nbbytes bytes --nowarning : do not emit warnings from parser/validator --noblanks : drop (ignorable?) blanks spaces --nocdata : replace cdata section with text nodes --format : reformat/reindent the input --encode encoding : output in the given encoding --dropdtd : remove the DOCTYPE of the input docs --c14n : save in W3C canonical format v1.0 (with comments) --c14n11 : save in W3C canonical format v1.1 (with comments) --exc-c14n : save in W3C exclusive canonical format (with comments) --nsclean : remove redundant namespace declarations --testIO : test user I/O support --catalogs : use SGML catalogs from $SGML_CATALOG_FILES otherwise XML Catalogs starting from file:///etc/xml/catalog are activated by default --nocatalogs: deactivate all catalogs --auto : generate a small doc on the fly --xinclude : do XInclude processing --noxincludenode : same but do not generate XInclude nodes --nofixup-base-uris : do not fixup xml:base uris --loaddtd : fetch external DTD --dtdattr : loaddtd + populate the tree with inherited attributes --stream : use the streaming interface to process very large files --walker : create a reader and walk though the resulting doc --pattern pattern_value : test the pattern support --chkregister : verify the node registration code --relaxng schema : do RelaxNG validation against the schema --schema schema : do validation against the WXS schema --schematron schema : do validation against a schematron --sax1: use the old SAX1 interfaces for processing --sax: do not build a tree but work just at the SAX level --oldxml10: use XML-1.0 parsing rules before the 5th edition

Libxml project home page: http://xmlsoft.org/ To report bugs or get some help check: http://xmlsoft.org/bugs.html Creating directory for downloaded files (./imgt_downloads/) Downloading files: ./imgt_downloads/Homo_sapiens_IGHV.fasta successfully downloaded. ./imgt_downloads/Homo_sapiens_IGHD.fasta successfully downloaded. ./imgt_downloads/Homo_sapiens_IGHJ.fasta successfully downloaded. ./imgt_downloads/Homo_sapiens_IGKV.fasta successfully downloaded. ./imgt_downloads/Homo_sapiens_IGKJ.fasta successfully downloaded. ./imgt_downloads/Homo_sapiens_IGLV.fasta successfully downloaded. ./imgt_downloads/Homo_sapiens_IGLJ.fasta successfully downloaded. ./imgt_downloads/Homo_sapiens_TRAV.fasta successfully downloaded. ./imgt_downloads/Homo_sapiens_TRAJ.fasta successfully downloaded. ./imgt_downloads/Homo_sapiens_TRBV.fasta successfully downloaded. ./imgt_downloads/Homo_sapiens_TRBD.fasta successfully downloaded. ./imgt_downloads/Homo_sapiens_TRBJ.fasta successfully downloaded. ./imgt_downloads/Homo_sapiens_TRDV.fasta successfully downloaded. ./imgt_downloads/Homo_sapiens_TRDD.fasta successfully downloaded. ./imgt_downloads/Homo_sapiens_TRDJ.fasta successfully downloaded. ./imgt_downloads/Homo_sapiens_TRGV.fasta successfully downloaded. ./imgt_downloads/Homo_sapiens_TRGJ.fasta successfully downloaded. Importing loci: IGH Error: Was passed main parameter ':' but no main parameter was defined IGK Error: Was passed main parameter ':' but no main parameter was defined IGL Error: Was passed main parameter ':' but no main parameter was defined TRA Error: Was passed main parameter ':' but no main parameter was defined TRB Error: Was passed main parameter ':' but no main parameter was defined TRG Error: Was passed main parameter ':' but no main parameter was defined TRD Error: Was passed main parameter ':' but no main parameter was defined

To use imported segments invoke mixcr with the following parameters: mixcr align --library local -s 'hsa ...

dbolotin commented 8 years ago

Thanks for reporting this issue! Seems that you are using an old version of libxml2. Try to update it.

dbolotin commented 8 years ago

Broken step will be excluded in the next release (see #67). I will post steps to update the script as soon as it will be available, if you have problems updating the library, or it does'n fix the issue.

peterch405 commented 8 years ago

Updated to the latest version of xmllint. TaxonId now works but there seem to be errors associated with duplicated in mouse IGK and IGL:

Starting importFromIMGT.sh script By using this script you agree to the terms of use of IMGT website. (see http://www.imgt.org/ for details). Press ENTER to continue or other key to exit... Available species: (0) Bos taurus (1) Camelus dromedarius (2) Canis lupus familiaris (3) Cercocebus atys (4) Danio rerio (5) Homo sapiens (6) Macaca fascicularis (7) Macaca mulatta (8) Macaca nemestrina (9) Mus (10) Mus cookii (11) Mus minutoides (12) Mus musculus (13) Mus pahari (14) Mus saxicola (15) Mus spretus (16) Oncorhynchus mykiss (17) Ornithorhynchus anatinus (18) Oryctolagus cuniculus (19) Papio anubis anubis (20) Rattus norvegicus (21) Rattus rattus (22) Sus scrofa (23) Vicugna pacos Please select species (e.g. '5' for Homo sapiens): 12 You selected: Mus musculus. Please enter a list of common species names for Mus musculus delimited by ':' to be used in -s option in 'mixcr align ...' (e.g. 'hsa:hs:homosapiens:human'): 'mouse:mm:musmusculus' Getting taxonId for Mus musculus from NCBI... xmllint: /lib64/libz.so.1: no version information available (required by /bi/apps/mixcr/libxml2-2.9.3/lib/libxml2.so.2) OK. TaxonId=10090 Creating directory for downloaded files (./imgt_downloads/) Downloading files: ./imgt_downloads/Mus_musculus_IGHV.fasta successfully downloaded. ./imgt_downloads/Mus_musculus_IGHD.fasta successfully downloaded. ./imgt_downloads/Mus_musculus_IGHJ.fasta successfully downloaded. ./imgt_downloads/Mus_musculus_IGKV.fasta successfully downloaded. ./imgt_downloads/Mus_musculus_IGKJ.fasta successfully downloaded. ./imgt_downloads/Mus_musculus_IGLV.fasta successfully downloaded. ./imgt_downloads/Mus_musculus_IGLJ.fasta successfully downloaded. ./imgt_downloads/Mus_musculus_TRAV.fasta successfully downloaded. ./imgt_downloads/Mus_musculus_TRAJ.fasta successfully downloaded. ./imgt_downloads/Mus_musculus_TRBV.fasta successfully downloaded. ./imgt_downloads/Mus_musculus_TRBD.fasta successfully downloaded. ./imgt_downloads/Mus_musculus_TRBJ.fasta successfully downloaded. ./imgt_downloads/Mus_musculus_TRDV.fasta successfully downloaded. ./imgt_downloads/Mus_musculus_TRDD.fasta successfully downloaded. ./imgt_downloads/Mus_musculus_TRDJ.fasta successfully downloaded. ./imgt_downloads/Mus_musculus_TRGV.fasta successfully downloaded. ./imgt_downloads/Mus_musculus_TRGJ.fasta successfully downloaded. Importing loci: IGH Processing... Writing report. Writing library file. Checking. Segments successfully imported. Resulting file contains following records: 10090:IGH: 453 records IGK Exception in thread "main" com.milaboratory.mixcr.reference.builder.FastaLocusBuilderException: Duplicate records for allele IGKV10-9401 at com.milaboratory.mixcr.reference.builder.FastaLocusBuilder.errorOrException(FastaLocusBuilder.java:126) at com.milaboratory.mixcr.reference.builder.FastaLocusBuilder.importAllelesFromStream(FastaLocusBuilder.java:246) at com.milaboratory.mixcr.reference.builder.FastaLocusBuilder.importAllelesFromFile(FastaLocusBuilder.java:151) at com.milaboratory.mixcr.cli.ActionImportSegments.go(ActionImportSegments.java:134) at com.milaboratory.mitools.cli.JCommanderBasedMain.main(JCommanderBasedMain.java:145) at com.milaboratory.mixcr.cli.Main.main(Main.java:64) IGL Exception in thread "main" com.milaboratory.mixcr.reference.builder.FastaLocusBuilderException: Duplicate records for allele IGLV201 at com.milaboratory.mixcr.reference.builder.FastaLocusBuilder.errorOrException(FastaLocusBuilder.java:126) at com.milaboratory.mixcr.reference.builder.FastaLocusBuilder.importAllelesFromStream(FastaLocusBuilder.java:246) at com.milaboratory.mixcr.reference.builder.FastaLocusBuilder.importAllelesFromFile(FastaLocusBuilder.java:151) at com.milaboratory.mixcr.cli.ActionImportSegments.go(ActionImportSegments.java:134) at com.milaboratory.mitools.cli.JCommanderBasedMain.main(JCommanderBasedMain.java:145) at com.milaboratory.mixcr.cli.Main.main(Main.java:64) TRA Processing... Writing report. Writing library file. Checking. Segments successfully imported. Resulting file contains following records: 10090:TRA: 321 records 10090:IGH: 453 records TRB Processing... Writing report. Writing library file. Checking. Segments successfully imported. Resulting file contains following records: 10090:TRA: 321 records 10090:TRB: 73 records 10090:IGH: 453 records TRG Processing... Writing report. Writing library file. Checking. Segments successfully imported. Resulting file contains following records: 10090:TRG: 32 records 10090:TRA: 321 records 10090:TRB: 73 records 10090:IGH: 453 records TRD Processing... Writing report. Writing library file. Checking. Segments successfully imported. Resulting file contains following records: 10090:TRA: 321 records 10090:TRB: 73 records 10090:TRD: 53 records 10090:IGH: 453 records 10090:TRG: 32 records

To use imported segments invoke mixcr with the following parameters: mixcr align --library local -s 'mouse ...

dbolotin commented 8 years ago

This error is caused by extremely bad formatted data in IMGT. Some segments are just duplicated in those files. This issue is already fixed in current development branch and will be released in 1.7.1 bug-fix release, somewhere in the middle of the next week.