lgatto / MSnbase

Base Classes and Functions for Mass Spectrometry and Proteomics
http://lgatto.github.io/MSnbase/
123 stars 50 forks source link

`R CMD check` failure on Linux ARM64 #595

Closed martin-g closed 10 months ago

martin-g commented 10 months ago

Hello,

As you may know the Bioconductor project started running the build and checks on Linux ARM64 in current devel (3.18). There is a test failure that I need help to debug it:

* checking for unstated dependencies in ‘tests’ ... OK
* checking tests ...
  Running ‘testthat.R’
 ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
  Error: Can not open file /home/biocbuild/.cache/R/rpx/3322c91fc0b96b_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML! Original error was: Error: [SpectrumList_mzML::create()] Bad istream.

  Backtrace:
      ▆
   1. └─mzR::openMSfile(f) at test_MSmap.R:8:0
   2.   └─base::tryCatch(...)
   3.     └─base (local) tryCatchList(expr, classes, parentenv, handlers)
   4.       └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
   5.         └─value[[3L]](cond)

  [ FAIL 1 | WARN 1092 | SKIP 12 | PASS 2372 ]
  Error: Test failures
  In addition: Warning message:
  closing unused connection 4 (/home/biocbuild/tmp/RtmpDjrNFD/file234af3ae22f9f) 
  Execution halted
* checking for unstated dependencies in vignettes ... OK

Any hints what to do to resolve the problem ? I maintain the Linux ARM64 machine used as a builder, so it is easy for me to try any suggestions!

Thank you!

martin-g commented 10 months ago
biocbuild@kunpeng2 ~/git> ll /home/biocbuild/.cache/R/rpx/3322c91fc0b96b_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
-rw-r--r-- 1 biocbuild biocbuild 137M Jun  7 14:43 /home/biocbuild/.cache/R/rpx/3322c91fc0b96b_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
biocbuild@kunpeng2 ~/git> head /home/biocbuild/.cache/R/rpx/3322c91fc0b96b_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
<?xml version="1.0" encoding="ISO-8859-1"?>
<indexedmzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.2_idx.xsd">
  <mzML xmlns="http://psi.hupo.org/ms/mzml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://psi.hupo.org/ms/mzml http://psidev.info/files/ms/mzML/xsd/mzML1.1.0.xsd" id="TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01" version="1.1.0">
    <cvList count="2">
      <cv id="MS" fullName="Proteomics Standards Initiative Mass Spectrometry Ontology" version="3.30.0" URI="http://psidev.cvs.sourceforge.net/*checkout*/psidev/psi/psi-ms/mzML/controlledVocabulary/psi-ms.obo"/>
      <cv id="UO" fullName="Unit Ontology" version="12:10:2011" URI="http://obo.cvs.sourceforge.net/*checkout*/obo/obo/ontology/phenotype/unit.obo"/>
    </cvList>
    <fileDescription>
      <fileContent>
        <cvParam cvRef="MS" accession="MS:1000579" name="MS1 spectrum" value=""/>

The first 10 lines of the file look OK!

But it seems it is incomplete. It ends with:

...
   </binaryDataArray>
            <binaryDataArray encodedLength="86052">
              <cvParam cvRef="MS" accession="MS:1000523" name="64-bit float" value=""/>
              <cvParam cvRef="MS" accession="MS:1000574" name="zlib compression" value=""/>
              <cvParam cvRef="MS" accession="MS:1000515" name="intensity array" value="" unitCvRef="MS" unitAccession="MS:1000131" unitName="number of counts"/>
              <binary>eJzsnWV0FdvSrhfuEjR4k+DuLo07BBLcGoIF1+DQuAR3p3F394bg7i5NcAhuG8291LP2GYdz+fiAPX/cce/Xf3qsrJWWOWuWvPVWTZfrHx9OkiT39P991js2/vjtbF598/Xb2bUq11v5/DjkjP7Pb+OyvSa8kus8juwq8e1zvEVyNh96fPj2d2tfjscq7mPdePXk23WMaC3l+k75QuG/nfVbkd7++/X1HpXeyee3pSN/+956WDKa/G7Q1Ajfztqarw++fa/pn5/LeceBBPJ95XZe387Gtcdp5PPkSynlPolKR5HPBxPJ+7hGzAwn101xlPdK0363PFeaM/JcRv19ceT7+l5ct2yLeN/OrhrjuE6H8G9UjIc5u+BruW/TknIf84XmLc/baG96eY7pdeQ9jGfLEsnzeG9hfp4/l/HRKq19ouI57DGxrny7jnV0fFQZ37Cecj8nR6kk8t57jsj7a3ULyf31tJ8v/ui+TuzKMi96uF0ip9a2KXI9p/KzuPJ/KfYll/fc0C+zvJfP3Lxy3TjFC8p9vhSRz/osPYP83Se7zIP+oNxNFe9p7Zshz2fXjYEcrAkncmCczcV7RRivZF7tscevfruOWW+vzJM1cX5OuU/zCDVkPOZk8Jd5HHK7NetgipzNeGZT+Tzjoo/8vvDjbPK7nKnef7uOk7XJ3W9n3Wws69684PWXPG/FFfL8Lr/a8j6OdzEZN/O8K7WMY4tt2eV6mVfmk79HTFRA/l4xRy45B4yS39mJ98m82RMWXlMxDtY0X5FPR8sTS8b38VG5v3XhVRV5jovp6sjzFg2uJ99PO+wn73F2Vhn5Pv+5TKzTyfL/rjWL9ny7nvFBk3XonKqSWH43bn8qeb9hjTX5Xe3SnvJ/BVfHlN9Nii3j4TqQKqHc590EkW/XhH0il/btFzK+drz4St7bzPfqoFznxRaZD+3CPvSHTz55Lqdbe3kOK3rm6PI8vdbJPNrz07xScv8J3nIdO1k/rttnadi3s76wFHISu6DIv9b5WKiK+1ldcosetVqPFP3s8i0o+sw6chw9VqaFrGtjrFFIxvueKXLgDI+RUd7/OvrdjjnvkpLnOVDx/rfraNGzR5T7pr8RQ+Tk8wPRR3bY0y/yfZJO95Xc73jxZ9+uoxcc80nOeSPLvLuWl5T1pH+JI/Ngrn8g+sXwP/9cxX3Nv56J/bA/dRU76srvHZVx3x5JPt/5iJ4bFnLi3+/nnHsh72299BD5NwqEE7vpeBoZ0TeR5exaUlf0ghP1XWz53WdLyXg5TU8+/XYdo/gD9NXGWLKezWFF5ayN7CvPb3onEHkyY6YU+TWDVok+1VcaL3/pOVbr87/9zq4eHf/iRlyZd5

Do you have an idea why/when this might happen ?

jorainer commented 10 months ago

Hi @martin-g ! Are there any issues or problems with the mzR package on ARM64? The original error seems to be coming from the proteowizard library (C++ code) that is bundled/shipped with mzR. MSnbase (and in fact also other packages like Spectra etc) are using mzR to read/import mzML data through that proteowizard library.

martin-g commented 10 months ago

Hi @jorainer !

Spectra works fine - https://bioconductor.org/checkResults/3.18/bioc-LATEST/Spectra/ mzR has some check warnings but again no errors - https://bioconductor.org/checkResults/3.18/bioc-LATEST/mzR/

I have the feeling MSnbase fails due to a timeout while downloading the big resource file. There is nothing in the logs about timeout but this is the only explanation I have at the moment why the XML file is incomplete. Some other Bioc packages had timeout problems when downloading from non-Bioc websites. Bioc uses geo CDN and usually the connections are stable and fast. But if a package needs to download a big file from a third-party website then it might be the reason!

lgatto commented 10 months ago

Hi - as far as I remember, MSnbase doesn't download any large files.

Largish files that would be used (mzML, as shown above) would either be taken from the msdata package or downloaded/cached using the rpx package.

martin-g commented 10 months ago

I deleted the files in ~/.cache/R/rpx/ and ran R CMD check MSnbase_2.27.1.tar.gz again. The new files in this folder are:

ll ~/.cache/R/rpx/
total 1.2G
-rw-r--r-- 1 biocbuild biocbuild 2.1K Aug 24 10:39 1774ba70337d4a_1774ba70337d4a.rds
-rw-r--r-- 1 biocbuild biocbuild 2.5K Aug 24 10:40 1774baef7ad97_1774baef7ad97.rds
-rw-r--r-- 1 biocbuild biocbuild 385M Aug 23 12:05 39716e4e7c52da_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
-rw-r--r-- 1 biocbuild biocbuild 365M Aug 22 23:13 3bcc8d23cbfaeb_QEP2LC6_HeLa_50ng_251120_02-calib.mzML
-rw-r--r-- 1 biocbuild biocbuild 387M Aug 22 23:07 3bcc8d59b2f2f6_QEP2LC6_HeLa_50ng_251120_01-calib.mzML
-rw-r--r-- 1 biocbuild biocbuild 3.3M Aug 23 08:41 3ead33595bf906_QEP2LC6_HeLa_50ng_251120_02-calib.mzID.gz
-rw-r--r-- 1 biocbuild biocbuild 3.3M Aug 23 08:38 3ead33767a46ae_QEP2LC6_HeLa_50ng_251120_01-calib.mzID.gz
-rw-r--r-- 1 biocbuild biocbuild 1.6M Aug 23 00:04 3edeb6242c5770_erwinia_carotovora.fasta
-rw-r--r-- 1 biocbuild biocbuild 298K Aug 23 00:03 3edeb626cee4cd_F063721.dat-mztab.txt
-rw-r--r-- 1 biocbuild biocbuild  20K Aug 24 10:40 BiocFileCache.sqlite
-rw------- 1 biocbuild biocbuild    0 Aug 22 12:46 BiocFileCache.sqlite.LOCK

Only 39716e4e7c52da_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML has a problem - again the XML is not properly finished:

...
</scanList>
          <precursorList count="1">
            <precursor spectrumRef="controllerType=0 controllerNumber=1 scan=6327">
              <isolationWindow>
                <cvParam cvRef="MS" accession="MS:1000827" name="isolation window target m/z" value="949.11" unitCvRef="MS" unitAccession="MS:1000040" unitName="m/z"/>
                <cvParam cvRef="MS" accession="MS:1000828" name="isolation window lower offset" value="1.0" unitCvRef="MS" unitAccession="MS:1000040" unitName="m/z"/>
                <cvParam cvRef="MS" accession="MS:1000829" name="isolation window upper offset" value="1.0" unitC

I will check rpx's sources to see where this file is downloaded from !

martin-g commented 10 months ago

Both http://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001/ and http://ftp.pride.ebi.ac.uk/pride-archive/2012/03/PXD000001/ provide TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML and its size is 429M, while in ~/.cache/R/rpx it was 385M.

As far as I understand the logic that downloads the files is at:

./rpx/R/px_remote.R:3:    valid <- try(RCurl::getURL(paste0(url, "/"), dirlistonly = TRUE),
./rpx/R/px.R:220:                  ans <- strsplit(getURL(ftpdir, dirlistonly = TRUE), "\n")[[1]]

I am going to try with a bigger connectiontimeout value for RCurl:

library(RCurl)
> getCurlOptionsConstants()[["connecttimeout"]]
[1] 78

I will add custom options, e.g. something like:

myOpts <- curlOptions(connecttimeout = 300)
urlCSV <- getURL(url, .opts = myOpts)

https://curl.se/libcurl/c/CURLOPT_CONNECTTIMEOUT.html says that the default is 300 seconds. I am not sure why RCurl uses 78

martin-g commented 10 months ago

It seems those getURL calls are only to get the file names. The real work is done by BiocFileCache. I'm still looking which timeout setting could be tuned!

lgatto commented 10 months ago

Yes, all the work is done via BiocFileCache.

Note that rpx gets its files from RPIDE (at the EBI), and that there have been many issues in the past. But these should have been solved by now.

That file gets downloaded from the PRIDE archive at https://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001/

martin-g commented 10 months ago

I have some progress! I made the following change in rpx and installed it locally:

biocbuild@kunpeng2 ~/g/rpx (devel)> git diff
diff --git a/R/cache.R b/R/cache.R
index 52ed49f..b805bda 100644
--- a/R/cache.R
+++ b/R/cache.R
@@ -168,7 +168,7 @@ allPXD <- function(cache = rpxCache()) {
         return(.read_and_parse_sitemap(rid$rpath))
     } else {
         ## update and return
-        bfcdownload(cache, rid$rid, ask = FALSE)
+        bfcdownload(cache, rid$rid, ask = FALSE, config = timeout(3000))
         rid <- bfcquery(cache, url)
         return(.read_and_parse_sitemap(rid$rpath))
     }

With timeout=3000 it downloaded 438689792 bytes out of 450032788 ! And failed with:

* checking tests ...
  Running ‘testthat.R’
 ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
      ▆
   1. ├─rpx::pxget(px1, "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML") at test_MSmap.R:7:0
   2. └─rpx::pxget(px1, "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML")
   3.   └─rpx (local) .local(object, ...)
   4.     └─rpx:::pxget1(uris[i], cache)
   5.       ├─BiocFileCache::bfcadd(cache, url, url)
   6.       └─BiocFileCache::bfcadd(cache, url, url)
   7.         └─BiocFileCache:::.util_download(...)

  [ FAIL 1 | WARN 1094 | SKIP 12 | PASS 2372 ]
  Error: Test failures
  In addition: Warning message:
  In for (i in seq_len(n)) { :
    closing unused connection 4 (/home/biocbuild/tmp/RtmpR5gFJT/file28d0851eee8e58)
  Execution halted

Then I thought: let's download the file on my dev machine (I am based in Bulgaria, Europe) and copy it (via scp/ssh) to the Bioc builder (based in Singapore) - ~/.cache/R/rpx/28d0855d003fd6_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML. But the next R CMD check MSnbase_2.27.1.tar.gz does not reuse the file but downloads a new one - 45efc6d4f69de_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML, note the leading 45efc6d4f69de_! So now I bumped the timeout to 30000 and I hope that it will download it completely this time!

The speed to https://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001/ is slow indeed! I guess the other Bioc builder nodes have downloaded the files once and now they reuse it from the cache and that's why they don't have this problem ?!

lgatto commented 10 months ago
martin-g commented 10 months ago

I see that the mapping from url to cached resource is in ~/.cache/R/rpx/BiocFileCache.sqlite. As a final resort I may add a record for the scp copied file manually to the resource table!

martin-g commented 10 months ago

No need to ask the EBI people for now! I think I better understand how it works and soon the test should pass! I will document the manual setup for the next time I need it and will revert the local change in rpx package.

lgatto commented 10 months ago

Yes, it's the rpx package that downloads/caches/handles these files using BiocFileCache.

lgatto commented 10 months ago

Thank you for all your work!

martin-g commented 10 months ago
...
* checking tests ...
  Running ‘testthat.R’
 OK
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes in ‘inst/doc’ ... OK
* checking running R code from vignettes ...
  ‘v01-MSnbase-demo.Rmd’ using ‘UTF-8’... OK
  ‘v02-MSnbase-io.Rmd’ using ‘UTF-8’... OK
  ‘v03-MSnbase-centroiding.Rmd’ using ‘UTF-8’... OK
  ‘v04-benchmarking.Rmd’ using ‘UTF-8’... OK
  ‘v05-MSnbase-development.Rmd’ using ‘UTF-8’... OK
 NONE
* checking re-building of vignette outputs ... OK
* checking PDF version of manual ... OK
* DONE

Status: 3 NOTEs
See
  ‘/home/biocbuild/git/MSnbase.Rcheck/00check.log’
for details.

Hooray! The workaround was to add a record in ~/.cache/R/rpx/BiocFileCache.sqlite! The next official build on Linux ARM64 should be green for MSnbase too !

Thank you for the support, @lgatto !

lgatto commented 10 months ago

Hooray! Thank you for your work!