peterjc / thapbi-pict

Tree Health and Plant Biosecurity Initiative - Phytophthora ITS1 Classifier Tool
https://thapbi-pict.readthedocs.io/
MIT License
8 stars 2 forks source link

Tests failing on AppVeyor #627

Closed peterjc closed 3 months ago

peterjc commented 3 months ago

They may be failing on CircleCI too, but masked by #626. Tests failing:

+ export DB=/tmp/ncbi_import/multiple_hmm.sqlite
+ DB=/tmp/ncbi_import/multiple_hmm.sqlite
+ rm -rf /tmp/ncbi_import/multiple_hmm.sqlite
+ thapbi_pict load-tax -d /tmp/ncbi_import/multiple_hmm.sqlite -t new_taxdump_2019-09-01
WARNING: Updating existing synonym entry for Phytophthora sp. CAL-2011a
WARNING: Updating existing synonym entry for Phytophthora sp. CAL-2011a
WARNING: Updating existing synonym entry for Phytophthora sp. CAL-2011a
Loaded 78 new genera, and 2925 new species entries with 1232 synonyms into DB (0, 0 and 3 already there)
++ sqlite3 /tmp/ncbi_import/multiple_hmm.sqlite 'SELECT COUNT(id) FROM taxonomy;'
+ '[' 3003 -ne 3003 ']'
+ cutadapt --quiet -g GAAGGTGAAGTCGTAACAAGG tests/ncbi-import/multiple_hmm.fasta
+ cutadapt --quiet -a GYRGGGACGAAAGTCYYTGC -o /tmp/ncbi_import/multiple_hmm.fasta -
+ thapbi_pict import -k ITS1 -l GAAGGTGAAGTCGTAACAAGG -r GCARRGACTTTCGTCCCYRC -c ncbi -d /tmp/ncbi_import/multiple_hmm.sqlite -g -i /tmp/ncbi_import/multiple_hmm.fasta -n 'NCBI examples with multiple HMM matches'
++ sqlite3 /tmp/ncbi_import/multiple_hmm.sqlite 'SELECT COUNT(id) FROM data_source;'
+ '[' 0 -ne 1 ']'
+ echo 'Wrong data_source count'
Wrong data_source count
+ false

No obvious cause in the code itself, perhaps a platform change? Before/after were both using cutadapt 4.8

peterjc commented 3 months ago

Noticed on #625.

peterjc commented 3 months ago

Time frame for an AppVeyor related change in 22 May to 3 June...

There have been updates to cutadapt 4.8 on bioconda, but should by from PyPI via pip - and no changes there since April 2024.

peterjc commented 3 months ago

Looks like the cutadapt call is failing, generated file /tmp/ncbi_import/multiple_hmm.fasta is now empty (should be 5 sequences).

Bad: Successfully installed biopython-1.83 contourpy-1.2.1 cutadapt-4.8 cycler-0.12.1 dnaio-1.2.0 fonttools-4.53.0 greenlet-3.0.3 isal-1.6.1 kiwisolver-1.4.5 matplotlib-3.9.0 networkx-3.3 numpy-1.26.4 pillow-10.3.0 pydot-2.0.0 pyparsing-3.1.2 python-dateutil-2.9.0.post0 rapidfuzz-3.9.3 six-1.16.0 sqlalchemy-2.0.30 typing-extensions-4.12.2 xlsxwriter-3.2.0 xopen-2.0.1 zlib-ng-0.4.3

Good: Successfully installed biopython-1.83 contourpy-1.2.1 cutadapt-4.8 cycler-0.12.1 dnaio-1.2.0 fonttools-4.53.0 greenlet-3.0.3 isal-1.6.1 kiwisolver-1.4.5 matplotlib-3.9.0 networkx-3.3 numpy-1.26.4 pillow-10.3.0 pydot-2.0.0 pyparsing-3.1.2 python-dateutil-2.9.0.post0 rapidfuzz-3.9.3 six-1.16.0 sqlalchemy-2.0.30 typing-extensions-4.12.1 xlsxwriter-3.2.0 xopen-2.0.1 zlib-ng-0.4.3

Surely not due to a point revision of typing-exensions?

peterjc commented 3 months ago

Interesting, fixing that test by pre-generating the FASTA file, it now fails on another cutadapt piping example:

+ rm -rf /tmp/ncbi_import/20th_Century_ITS1.fasta /tmp/ncbi_import/20th_Century_ITS1_Peronosporaceae.fasta
+ cat tests/ncbi-import/20th_Century_ITS1.fasta
+ sed 's/ P\./ Phytophthora /g'
+ cutadapt --quiet -g GAAGGTGAAGTCGTAACAAGG -
+ cutadapt --quiet -a GYRGGGACGAAAGTCYYTGC -o /tmp/ncbi_import/20th_Century_ITS1.fasta -
Format of input file '0' not recognized.
peterjc commented 3 months ago

This reminds me of https://github.com/marcelm/cutadapt/issues/774 which was traced to a problem in xopen 2.0.0 https://github.com/pycompression/xopen/issues/157

But it is different, the AppVeyor failures are using xopen-2.0.1-py3-none-any.whl

I note cutadapt 4.8 does use typing_extensions to define a sequence protocol, https://github.com/marcelm/cutadapt/blob/v4.8/src/cutadapt/qualtrim.pyi

Test case working locally but under Python 3.10 not 3.12:

$ pip install xopen==2.0.1 dnaio==1.2.0 cutadapt==4.8 typing-extensions==4.12.2
$ cutadapt --quiet -g GAAGGTGAAGTCGTAACAAGG tests/ncbi-import/multiple_hmm.fasta \
| cutadapt --quiet -a GYRGGGACGAAAGTCYYTGC -o x.fasta - ; grep "^>" x.fasta -c
5
peterjc commented 3 months ago

Couldn't reproduce this on Linux with Python 3.11.6 either, but given I suspect the typing extensions Python 3.12 might be key here.

peterjc commented 3 months ago

Not typing-extensions 4.12.2, this faild on AppVeyor:

Successfully installed biopython-1.83 contourpy-1.2.1 cutadapt-4.8 cycler-0.12.1 dnaio-1.2.0 fonttools-4.53.0 greenlet-3.0.3 isal-1.6.1 kiwisolver-1.4.5 matplotlib-3.9.0 networkx-3.3 numpy-1.26.4 packaging-24.0 pillow-10.3.0 pydot-2.0.0 pyparsing-3.1.2 python-dateutil-2.9.0.post0 rapidfuzz-3.9.3 six-1.16.0 sqlalchemy-2.0.30 thapbi-pict-1.0.13 typing-extensions-4.12.2 xlsxwriter-3.2.0 xopen-2.0.1 zlib-ng-0.4.3
peterjc commented 3 months ago

Trying locally with Python 3.12.3 to match AppVeyor, and using the current master branch:

$ conda create -y --name py_3_12 python==3.12.3
$ conda activate py_3_12
$ conda install sqlite
$ python -m pip install -U pip setuptools  virtualenv wheel build
...
Successfully installed build-1.2.1 distlib-0.3.8 filelock-3.14.0 packaging-24.1 platformdirs-4.2.2 pyproject_hooks-1.1.0 virtualenv-20.26.2
$ python -m pip install -r requirements.txt
...
Successfully installed biopython-1.83 contourpy-1.2.1 cutadapt-4.8 cycler-0.12.1 dnaio-1.2.0 fonttools-4.53.0 greenlet-3.0.3 isal-1.6.1 kiwisolver-1.4.5 matplotlib-3.9.0 networkx-3.3 numpy-1.26.4 pillow-10.3.0 pydot-2.0.0 pyparsing-3.1.2 python-dateutil-2.9.0.post0 rapidfuzz-3.9.3 six-1.16.0 sqlalchemy-2.0.30 typing-extensions-4.12.2 xlsxwriter-3.2.0 xopen-2.0.1 zlib-ng-0.4.3
$ python -m pip install -e .
...
Successfully installed thapbi_pict-1.0.13
$ tests/test_ncbi-import.sh
...
tests/test_ncbi-import.sh - test_ncbi-import.sh passed

Hmm. Works here too - what is different under AppVeyor?

peterjc commented 3 months ago

I think I need to try and run this locally under Windows...

peterjc commented 3 months ago

Note since then xopen v2.0.2 (2024-06-12) was released but did not makeany differnce to this issue with AppVeyor.

marcelm commented 3 months ago

I ran your tests in a Windows 10 virtual machine that I keep around for these kinds of things, but could not reproduce your problem.

I’m sure I didn’t fully reproduce what AppVeyor does (and my Windows knowledge is too limited), though.

I’m wondering whether this is an encoding issue. This message that Cutadapt prints

Format of input file '0' not recognized.

comes from detect_file_format. In this case, it means that the first character read from stdin was not a >, so somehow the input doesn’t look like a FASTA file. Maybe the file starts with a BOM?

Also, one of the later tests fails like this:

thapbi_pict denoise -i tests/read-correction/AA.before.fasta -o /tmp/sample-tally/after.fasta --denoise unoise-l --minlen 60 -t 0 -α 2.0 -γ 4
usage: thapbi_pict [-h] [-v]
                   {pipeline,load-tax,import,dump,conflicts,prepare-reads,fasta-nr,denoise,sample-tally,classify,assess,summary,edit-graph,ena-submit}
                   ...
thapbi_pict: error: unrecognized arguments: -a 2.0 -? 4

Note the question mark instead of the gamma character in the error message (and alpha became a regular a). This may be independent, but it at least gave me the idea that some encoding thing is going on.

Oh, and maybe as a workaround, you could avoid running two Cutadapt processes connected by a pipe. Instead of trimming the two primers in two steps, you can use a linked adapter. Instead of

cutadapt --quiet -g GAAGGTGAAGTCGTAACAAGG tests/ncbi-import/multiple_hmm.fasta \
| cutadapt --quiet -a GYRGGGACGAAAGTCYYTGC -o x.fasta -

you would use

cutadapt --quiet -a GAAGGTGAAGTCGTAACAAGG...GYRGGGACGAAAGTCYYTGC -o x.fasta tests/ncbi-import/multiple_hmm.fasta

I’ll also make the Format of input file ... not recognized. message a bit more helpful.

peterjc commented 3 months ago

Thank you. Good guess on the BOM, but no:

$ hexdump -C multiple_hmm.fasta | head
00000000  3e 4d 46 33 37 30 35 37  31 2e 31 20 50 68 79 74  |>MF370571.1 Phyt|
00000010  6f 70 68 74 68 6f 72 61  20 70 61 6c 6d 69 76 6f  |ophthora palmivo|
00000020  72 61 20 69 73 6f 6c 61  74 65 20 46 47 2d 34 31  |ra isolate FG-41|
00000030  20 69 6e 74 65 72 6e 61  6c 20 74 72 61 6e 73 63  | internal transc|
00000040  72 69 62 65 64 20 73 70  61 63 65 72 20 31 2c 20  |ribed spacer 1, |
00000050  70 61 72 74 69 61 6c 20  73 65 71 75 65 6e 63 65  |partial sequence|
00000060  3b 20 35 2e 38 53 20 72  69 62 6f 73 6f 6d 61 6c  |; 5.8S ribosomal|
00000070  20 52 4e 41 20 67 65 6e  65 2c 20 63 6f 6d 70 6c  | RNA gene, compl|
00000080  65 74 65 20 73 65 71 75  65 6e 63 65 3b 20 61 6e  |ete sequence; an|
00000090  64 20 69 6e 74 65 72 6e  61 6c 20 74 72 61 6e 73  |d internal trans|

I was able to get Format of input file '0' not recognized. on Windows within a command prompt session trying to read the FASTA file via stdin under Python 3.12 and the latest cutadapt and dependencies. That clearer error message should help pin this down, I will retest with that.

Your idea about the encoding seems more likely. That would certainly explain the alpha or gamma switch failing (they have plain ASCII alias too, I add the symbolic argument partly to see how well supported it would be - it seemed to be working nicely cross platform but I guess there are still corner cases).

peterjc commented 3 months ago

I first tried installing Python 3.12 from the Microsoft Store, but while pip install cutadapt worked it did not seem to be added to the PATH as a usable command. So I used the installer from python.org and ticked add to PATH, and then pip install gave a usable cutadapt:

C:\Users\xxx>python --version
Python 3.12.4

C:\Users\xxx>pip install -U cutadapt
Collecting cutadapt
  Downloading cutadapt-4.9-cp312-cp312-win_amd64.whl.metadata (3.5 kB)
Collecting dnaio>=1.2.0 (from cutadapt)
  Downloading dnaio-1.2.1-cp312-cp312-win_amd64.whl.metadata (3.6 kB)
Collecting xopen>=1.6.0 (from cutadapt)
  Downloading xopen-2.0.2-py3-none-any.whl.metadata (15 kB)
Collecting isal>=1.6.1 (from xopen>=1.6.0->cutadapt)
  Downloading isal-1.6.1-cp312-cp312-win_amd64.whl.metadata (10 kB)
Collecting zlib-ng>=0.4.1 (from xopen>=1.6.0->cutadapt)
  Downloading zlib_ng-0.4.3-cp312-cp312-win_amd64.whl.metadata (6.9 kB)
Downloading cutadapt-4.9-cp312-cp312-win_amd64.whl (231 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 231.1/231.1 kB 939.6 kB/s eta 0:00:00
Downloading dnaio-1.2.1-cp312-cp312-win_amd64.whl (86 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86.2/86.2 kB 808.2 kB/s eta 0:00:00
Downloading xopen-2.0.2-py3-none-any.whl (17 kB)
Downloading isal-1.6.1-cp312-cp312-win_amd64.whl (201 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.5/201.5 kB 942.2 kB/s eta 0:00:00
Downloading zlib_ng-0.4.3-cp312-cp312-win_amd64.whl (88 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.5/88.5 kB 1.7 MB/s eta 0:00:00
Installing collected packages: zlib-ng, isal, xopen, dnaio, cutadapt
Successfully installed cutadapt-4.9 dnaio-1.2.1 isal-1.6.1 xopen-2.0.2 zlib-ng-0.4.3

C:\Users\xxx>cutadapt --version
4.9

Minimal test case using filename as input works:

C:\Users\xxx\Projects>cutadapt --quiet -g GAAGGTGAAGTCGTAACAAGG multiple_hmm.fasta | find ">"
>MF370571.1 Phytophthora palmivora isolate FG-41 internal transcribed spacer 1, partial sequence; 5.8S ribosomal RNA gene, complete sequence; and internal transcribed spacer 2, partial sequence
>MH169111.1 Phytophthora vignae f. sp. vignae isolate PcV2: internal transcribed spacer 1, partial sequence; 5.8S ribosomal RNA gene and internal transcribed spacer 2, complete sequence; and large subunit ribosomal RNA gene, partial sequence
>DQ641247.1 Phytophthora ramorum isolate 2195 internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence
>MF095142.1 Uncultured Peronosporaceae clone MZOo17 small subunit ribosomal RNA gene, partial sequence; internal transcribed spacer 1 and 5.8S ribosomal RNA gene, complete sequence; and internal transcribed spacer 2, partial sequence
>KP691407.1 Uncultured Phytophthora clone sp3 18S ribosomal RNA gene and internal transcribed spacer 1, partial sequence

Changed to pipe file via stdin, fails:

C:\Users\xxx\Projects>type multiple_hmm.fasta | cutadapt --quiet -g GAAGGTGAAGTCGTAACAAGG -
Format of input file '0' not recognized.

Updated the latest cutadapt Python files from git (copied over the installed v4.9 ones, I don't have MSVC etc installed):

C:\Users\pc40583\Projects>type c:\Users\pc40583\Projects\multiple_hmm.fasta | cutadapt --quiet -g GAAGGTGAAGTCGTAACAAGG -
Input file format not recognized. The file starts with b' Unc', but files in supported formats start with '>' (FASTA), '@' (FASTQ) or 'BAM'

That seems to be from the second to last sequence in the file (Uncultured Peronosporaceae clone MZOo17), i.e. the start of the stdin appears to get lost somewhere - perhaps linked to why detection is called twice?

C:\Users\pc40583\Projects>type c:\Users\pc40583\Projects\multiple_hmm.fasta | cutadapt --quiet -g GAAGGTGAAGTCGTAACAAGG -
detect_file_format True, 0, b'>MF3'
detect_file_format True, 0, b' Unc'
Input file format not recognized. The file starts with b' Unc', but files in supported formats start with '>' (FASTA), '@' (FASTQ) or 'BAM'

Here I inserted this into the start of the detect_file_format function:

    import sys
    sys.stderr.write(f"detect_file_format {file.seekable()}, {file.tell()}, {file.peek(4)[0:4]}\n")

P.S. This is on Windows 11