pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
773 stars 274 forks source link

Pysam returning libcurl error 77 when accessing public S3 file? #1257

Open daisieh opened 8 months ago

daisieh commented 8 months ago

This started happening for us in pysam 0.22.0; it doesn't happen in 0.21.0.

In case it helps, here is pip list and a quick example of the error:

root@ef448fe76fd3:/app/htsget_server# pip list
Package                   Version
------------------------- ----------
attrs                     23.1.0
candigv2-authx            1.0.0
certifi                   2023.11.17
charset-normalizer        3.3.2
click                     8.1.7
clickclick                20.10.2
connexion                 2.14.1
exceptiongroup            1.2.0
Flask                     2.2.5
Flask-Cors                3.0.10
greenlet                  3.0.2
idna                      3.6
inflection                0.5.1
iniconfig                 2.0.0
itsdangerous              2.1.2
Jinja2                    3.1.2
jsonschema                4.20.0
jsonschema-specifications 2023.11.2
MarkupSafe                2.1.1
minio                     7.1.14
packaging                 23.2
pip                       23.0.1
pluggy                    1.3.0
psycopg2-binary           2.9.9
pysam                     0.22.0
pytest                    7.2.0
PyYAML                    6.0.1
referencing               0.32.0
requests                  2.31.0
rpds-py                   0.15.2
setuptools                65.5.1
six                       1.16.0
SQLAlchemy                1.4.44
swagger-ui-bundle         0.0.9
tomli                     2.0.1
urllib3                   2.1.0
uWSGI                     2.0.23
Werkzeug                  2.3.8
wheel                     0.42.0

[notice] A new release of pip is available: 23.0.1 -> 23.3.2
[notice] To update, run: pip install --upgrade pip
root@ef448fe76fd3:/app/htsget_server# python
Python 3.10.13 (main, Dec 19 2023, 20:49:50) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pysam
>>> x = pysam.VariantFile("https://1000genomes.s3.us-east-1.amazonaws.com/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz")
[E::easy_errno] Libcurl reported error 77 (Problem with the SSL CA cert (path? access rights?))
[E::hts_open_format] Failed to open file "https://1000genomes.s3.us-east-1.amazonaws.com/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz" : Input/output error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pysam/libcbcf.pyx", line 4117, in pysam.libcbcf.VariantFile.__init__
  File "pysam/libcbcf.pyx", line 4342, in pysam.libcbcf.VariantFile.open
OSError: [Errno 5] could not open variant file `b'https://1000genomes.s3.us-east-1.amazonaws.com/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz'`: Input/output error

The main error seems to be

[E::easy_errno] Libcurl reported error 77 (Problem with the SSL CA cert (path? access rights?))

But it seems odd that this error doesn't happen with pysam 0.21.0 or lower...

daisieh commented 7 months ago

It's possible that this is a Docker thing...?

litaifang commented 5 months ago

I created a duplicate issue several weeks ago. This happens with other remote locations too (google cloud and https bam files): https://github.com/pysam-developers/pysam/issues/1268

jmarshall commented 3 months ago

Thanks for the report. Increasing the verbosity helps identify the problem:

>>> import pysam
>>> pysam.set_verbosity(9)
3
>>> pysam.AlignmentFile('s3://example/foo.bam')
[…]
*   Trying 3.5.7.133...
* TCP_NODELAY set
* Connected to example.s3.amazonaws.com (3.5.7.133) port 443 (#0)
* ALPN, offering http/1.1
* error setting certificate verify locations:
  CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
[E::easy_errno] Libcurl reported error 77 (Problem with the SSL CA cert (path? access rights?))
[E::hts_open_format] Failed to open file "s3://example/foo.bam" : Input/output error
[…]

This /etc/pki/tls/certs/ca-bundle.crt path is RedHat/CentOS/Fedora's convention for the CAfile. You are probably running on Debian or Ubuntu, where the conventional path is /etc/ssl/certs/ca-certificates.crt and the path it's looking for does not exist.

You can work around this by exporting CURL_CA_BUNDLE so that pysam's libcurl will look for these files in the right place:

export CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
jmarshall commented 3 months ago

As for why this has started happening in 0.22.0, it has to do with how wheels are built. The problem is in the system libraries that are distributed inside manylinux wheels, so this is related to #1097 and #1276.

The pysam-0.21.0 Linux wheels were built for manylinux_2_24, which is now EOL and was based on Debian 9. These wheels contain a copy of libcurl-9f97daa0.so.4.4.0, which as it was built on Debian defaults to the /etc/ssl/… path.

More recent pysam releases' Linux wheels have been built for manylinux_2_28, which is based on AlmaLinux 8. These wheels contain a copy of libcurl-14d1b62d.so.4.5.0, which as it was built on a Red Hat-compatible defaults to the /etc/pki/… path.

Fedora (40, at least) contains symlinks under /etc so that both styles of path point to the same certificate bundle, so works with both flavours of wheel. (But e.g. Rocky and Alma do not; see also this bug.) Debian and Ubuntu do not, so only the wheel containing the Debian-style libcurl.so will work (without assistance from the environment variable).

This would appear to be a limitation in manylinux's claim to be making wheels that are portable across distributions!

This can be worked around by having everyone set $CURL_CA_BUNDLE as appropriate, but that is less than ideal. Ways of dealing with this when building future wheels would include:

  1. Manylinux may find a way to fix the libcurl.so that they ship.
  2. Because manylinux_2_24 is EOL, reverting to building that flavour of wheel is a non-starter.
  3. Pysam could patch its copy of _hfilelibcurl.c to detect what paths are available at runtime and set CURLOPT_CAINFO accordingly, so that it would automatically work with whichever path style was present.
  4. The real problem here is the large number of system libraries that get pulled into our manylinux wheels.

    If we omitted the plugins from pysam wheels, libcurl.so and many other libraries would not be pulled into our wheels, and this and the two issues mentioned above would be fixed at a stroke. These plugins should not really be shipped within the Python world at all; it would be better if pysam could access externally-provided (non-Pythonised) plugin object files, via $HTS_PATH if necessary. But transitioning to that model would be a non-trivial deployment problem.

Long-term the correct approach is (4) as it solves numerous problems: these three issues and also reduces the size of our wheels. It may be worth doing (3) too in the interim.

daisieh commented 3 months ago

From my POV, the fix suggested works for us! Thank you so much, and if you all feel that the root issue should be carried on elsewhere instead of in here, feel free to close.

jmarshall commented 3 months ago

Glad to hear it does the trick.

Let's keep this one open to represent the interim fix (3), and in due course I'll open another issue to represent (4).