thelovelab / tximeta

Transcript quantification import with automatic metadata detection
https://thelovelab.github.io/tximeta/
66 stars 11 forks source link

cached transcript fails to build #56

Closed JudithR closed 3 years ago

JudithR commented 3 years ago

Hi, I am not sure whether tximeta is to blame, but I have an issue I can't resolve. I hope you might have an idea and can point me in the right direction.

I'm running R 4.01 with tximeta 1.8.4 (for the rest see attached sessionInfo) I'm using the Parus major transcriptome from Ensembl 97. I had this as a cached version for quite some time and it worked without issue. Then our server got updated and I had to rebuild everything as file paths changed. Now I can't seem to build the transcriptome. It just hangs. I use the makeLinkedTxome option as the Parus major genome is not hashed and not a member of Ensembl anymore.

fastaPath = "~/projects/parus_major/ensembl/Parus_major.Parus_major1.1.cdna.all.fa"
gtfPath = "~/projects/parus_major/ensembl/Parus_major.Parus_major1.1.97.chr.gtf"
indexDir = "~/projects/parus_major/ensembl/pmaj_index"

makeLinkedTxome(indexDir=indexDir, source="Ensembl", organism="Parus major", release="97", genome="pmaj_1.1", fasta=fastaPath, gtf=gtfPath )

I checked the files and locations and they exist, the files are also readable. I rebuild the salmon index (salmon 1.1.0) to be sure it had not been corrupted.

It makes the bfc:

writing linkedTxome to ~/projects/parus_major/ensembl/pmaj_index.json
saving linkedTxome in bfc (first time)
bfc <- BiocFileCache()
bfcinfo(bfc)
# A tibble: 1 x 10
  rid   rname  create_time access_time rpath  rtype fpath last_modified_t… etag
  <chr> <chr>  <chr>       <chr>       <chr>  <chr> <chr>            <dbl> <chr>
1 BFC1  linke… 2021-04-12… 2021-04-12… ~/.ca… rela… 9c4c…               NA NA
# … with 1 more variable: expires <dbl>

I know the old one (that stopped working) had 3 entries, not one.

The pmaj_index.json looks good:

[
  {
    "index": "pmaj_index",
    "source": "Ensembl",
    "organism": "Parus major",
    "release": "97",
    "genome": "pmaj_1.1",
    "fasta": ["~/projects/parus_major/ensembl/Parus_major.Parus_major1.1.cdna.all.fa"],
    "gtf": "~/projects/parus_major/ensembl/Parus_major.Parus_major1.1.97.chr.gtf",
    "sha256": "b67a909c7ee0bfe8099bced0f67361732607af315a85fc63f8f0e35effde034f"
  }
]

The rds file in \~/.cache/BiocFileCache/ shows the gtf file.

read_rds("~/.cache/BiocFileCache/9c83a467d1301_9c83a467d1301.rds")
# A tibble: 1 x 8
  index  source  organism  release genome fasta gtf              sha256
  <chr>  <chr>   <chr>     <chr>   <chr>  <lis> <chr>            <chr>
1 pmaj_… Ensembl Parus ma… 97      pmaj_… <chr… ~/… b67a909c7ee0bf…

and when I try to run tximeta it starts fine. It recognizes the transcriptome and then

tries to build the complete cache?

> se_s1 <- tximeta(samples_s1)
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
found matching linked transcriptome:
[ Ensembl - Parus major - release 97 ]
useHub=TRUE: checking for EnsDb via 'AnnotationHub'
  |======================================================================| 100%

snapshotDate(): 2020-10-27
found matching EnsDb via 'AnnotationHub'
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

loading from cache
require("ensembldb")

Without hub:

se_s1 <- tximeta(samples_s1, useHub=FALSE)
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
found matching linked transcriptome:
[ Ensembl - Parus major - release 97 ]
building EnsDb with 'ensembldb' package
Importing GTF file ... OK

And there it stops, forever and unrecoverable (I have to terminate R using kill)

I used strace to trace the process and I can see it reads the counts files and the gtf file, but nothing else. This is the last thing it did:

641082 lstat("~/.cache/AnnotationHub/9c83a72f2485f_80687", {st_mode=S_IFREG|0644, st_size=120426496, ...}) = 0
641082 lstat("~/.cache/AnnotationHub/9c83a72f2485f_80687", {st_mode=S_IFREG|0644, st_size=120426496, ...}) = 0
641082 getpid()                         = 641082
641082 openat(AT_FDCWD, \~/.cache/AnnotationHub/9c83a72f2485f_80687", O_RDONLY|O_NOFOLLOW|O_CLOEXEC) = 11
641082 fstat(11, {st_mode=S_IFREG|0644, st_size=120426496, ...}) = 0
641082 fstat(11, {st_mode=S_IFREG|0644, st_size=120426496, ...}) = 0
641082 stat("~/.cache/AnnotationHub/9c83a72f2485f_80687", {st_mode=S_IFREG|0644, st_size=120426496, ...}) = 0
641082 lseek(11, 0, SEEK_SET)           = 0
641082 read(11, "SQLite format 3\0\20\0\1\1\0@  \0\0\0\34\0\0r\331"..., 100) = 100
641082 fcntl(11, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}

Am I missing something? I removed the pmaj_index.json, the .cache/tximeta/bfcloc.json and everything in .cache/BiocFileCache prior to the rerun. I also followed the instructions in the vignette about removing a linked Txome.

session_info.txt

mikelove commented 3 years ago

hi Judith,

Thanks for the report.

What happens after importing the GTF is that it will try to cache this database in your specified BiocFileCache location.

It may be worth testing if you can do these steps manually.

bfc <- BiocFileCache::BiocFileCache()
txdb <- GenomicFeatures::makeTxDbFromGFF(...)
loc <- BiocFileCache::bfcnew(bfc, rname="testing", ext=".sqlite")
saveDb(txdb, file=loc)
JudithR commented 3 years ago

Thx for pointing me in the right direction. At least no I know which part fails. Assuming you meant GenomicFeatures. I managed to create the database, but it won't write it to the file. However changing the cache location to /tmp works

txdb <- makeTxDbFromGFF(file=gtfPath, dataSource="EnsemblDbv97", organism="Parus major", chrominfo=chromInd)
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.
> saveDb(txdb, file=loc)
Error: Failed to copy all data:
not an error
In addition: Warning message:
Couldn't set synchronous mode: database is locked
Use `synchronous` = NULL to turn off this warning.

With tmp dir

bfc <- BiocFileCache::BiocFileCache("/tmp/judithr/pmaj_index_cache")
using temporary cache /tmp/Rtmp2O0DyA/BiocFileCache
> loc <- BiocFileCache::bfcnew(bfc, rname="testing", ext=".sqlite")
> saveDb(txdb, file=loc)
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: EnsemblDbv97
# Organism: Parus major
# Taxonomy ID: 9157
# miRBase build ID: NA
# Genome: NA
# Nb of transcripts: 33036
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2021-04-12 15:14:08 +0200 (Mon, 12 Apr 2021)
# GenomicFeatures version at creation time: 1.42.3
# RSQLite version at creation time: 2.2.6
# DBSCHEMAVERSION: 1.2

Have you encountered something like this before? I think the new file system is to blame, and not so much Bioconductor and tximeta.

mikelove commented 3 years ago

This is useful, I think a possibility is that R doesn't have permissions to write to the new BiocFileCache location, or something is problematic in that step.

Maybe then try out simple test:

bfc <- BiocFileCache::BiocFileCache()
loc <- BiocFileCache::bfcnew(bfc, rname="testing2")
x <- 1:10
save(x, file=loc)

And then you can figure out why the directory permissions are an issue.

JudithR commented 3 years ago

That works fine. So does writing the first sqlite entry, just not the second. After I checked out the devel branch also tximeta returns the cursor with the same error message as the separate steps. This are the contents of the .cache folder

drwxr-xr-x 3 me domain users   3 Apr 12 15:49 ..
-rw-r--r-- 1 me domain users 347 Apr 12 15:49 9ea1c4955a5_9ea1c4955a5.rds
-rw-r--r-- 1 me domain users 20K Apr 12 15:50 BiocFileCache.sqlite
-rw-r--r-- 1 me domain users   0 Apr 12 15:50 9ea1c6ea8b017_9ea1c6ea8b017.sqlite
drwxr-xr-x 2 me domain users   5 Apr 12 15:50 .

So it's not that it can't write, it just can't write this one to any location on the file system. Is it possible that the database handle is not released after touching the file? But why it would then work on /tmp is beyond me.

mikelove commented 3 years ago

I think now that you've narrowed it to a simple test example, I'd recommend to post to support.bioconductor.org with BiocFileCache tag, to see if we can get advice from the developers.

So just the case where you try to saveDb to the BFC location and you get the "Failed to copy all data" error.

JudithR commented 3 years ago

Thx for all the help. I posted it, referring to this thread. When copying stuff across I did see one warning when using tximeta with /tmp that might be relevan (the db disconnect), but not sure where it is generated.

> se_s1 <- tximeta(samples_s1, useHub=FALSE)
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
found matching linked transcriptome:
[ Ensembl - Parus major - release 97 ]
building EnsDb with 'ensembldb' package
Importing GTF file ... OK
Processing metadata ... OK
Processing genes ...
 Attribute availability:
  o gene_id ... OK
  o gene_name ... OK
  o entrezid ... Nope
  o gene_biotype ... OK
OK
Processing transcripts ...
 Attribute availability:
  o transcript_id ... OK
  o gene_id ... OK
  o transcript_biotype ... OK
OK
Processing exons ... OK
Processing chromosomes ... Fetch seqlengths from ensembl ... OK
Generating index ... OK
  -------------
Verifying validity of the information in the database:
Checking transcripts ... OK
Checking exons ... OK
generating transcript ranges
Warning messages:
1: closing unused connection 3 (ftp://ftp.ensembl.org/pub/release-97/mysql/)
2: In checkAssays2Txps(assays, txps) :

Warning: the annotation is missing some transcripts that were quantified.
1133 out of 27493 txps were missing from GTF/GFF but were in the indexed FASTA.
(This occurs sometimes with Ensembl txps on haplotype chromosomes.)
In order to build a ranged SummarizedExperiment, these txps were removed.
To keep these txps, and to skip adding ranges, use skipMeta=TRUE

Example missing txps: [ENSPMJT00000000007, ENSPMJT00000000011, ENSPMJT00000000017, ...]

3: call dbDisconnect() when finished working with a connection
mikelove commented 3 years ago

The warning about annotation missing some txps, and the warning about dbDisconnect are both ok (I get those on my end as well).

JudithR commented 3 years ago

Hi Mike,

Thx for all the help. The IT department figured it out. The firewall of the new storage server caused issues with writing larger blocks to disk.

mikelove commented 3 years ago

Thanks for the reports!