kingsfordgroup / sailfish

Rapid Mapping-based Isoform Quantification from RNA-Seq Reads

http://www.cs.cmu.edu/~ckingsf/software/sailfish

GNU General Public License v3.0

124 stars 45 forks source link

Error with Sailfish version Sailfish-0.6.3-Linux_x86-64 while running the index on NCBI_spiked in transcriptome #79

Closed vd4mmind closed 9 years ago

vd4mmind commented 9 years ago

Hi,

I downloaded the binary version of the Sailfish-0.6.3-Linux_x86-64 from the website and after unpacking it I have already followed the instruction of exporting the path in my .bashrc file

My bashrch looks like below export LD_LIBRARY_PATH=/scratch/GT/softwares/Sailfish-0.6.3-Linux_x86-64/lib:$LD_LIBRARY_PATH export PATH=$PATH:/scratch/GT/softwares/Sailfish-0.6.3-Linux_x86-64/bin:$PATH

So when I invoke the sailfish command it works ,

However when am running the index command on my transcriptome file which is NCBI_spiked GRCh38 I am getting some error. sailfish index -t /scratch/GT/genomes/Homo_sapiens/NCBI_spiked/GRCh38/Sequence/Chromosomes_clean -o /scratch/GT/vdas/benchmark/sailfishindex -k 20

Error Version Info: This is the most recent version writing logs to /scratch/GT/vdas/benchmark/sailfishindex/logs readFile: /scratch/GT/genomes/Homo_sapiens/NCBI_spiked/GRCh38/Sequence/Chromosomes_clean, file /scratch/GT/genomes/Homo_sapiens/NCBI_spiked/GRCh38/Sequence/Chromosomes_clean: reading from 1 streams reading from /scratch/GT/genomes/Homo_sapiens/NCBI_spiked/GRCh38/Sequence/Chromosomes_clean Running Jellyfish on transcripts Exiting, log location: /scratch/GT/vdas/benchmark/sailfishindex/logs/sailfish.g2log.20150728-161609.log Exception : [boost::filesystem::file_size: Operation not permitted: "/scratch/GT/genomes/Homo_sapiens/NCBI_spiked/GRCh38/Sequence/Chromosomes_clean"] sailfish index was invoked improperly. For usage information, try sailfish index --help Exiting.

Can you please tell me where am getting wrong? It seems are problem of the boost , is it? But for the binary I do not have to install the boost, let me know how can I fix this issue?

vd4mmind commented 9 years ago

I think I have understood where the issue lies, I was not providing the transcriptome .fa file, it is working, now my commnad is

sailfish index -t /scratch/GT/genomes/Homo_sapiens/NCBI_spiked/GRCh38/Sequence/RSEM3/transcriptome.idx.fa -o /scratch/GT/vdas/benchmark/sailfishindex -k 20 -p 16

It ran and the log file seem to be fine except at the end I get the below warning

reading from 1 streams reading from /scratch/GT/genomes/Homo_sapiens/NCBI_spiked/GRCh38/Sequence/RSEM3/transcriptome.idx.fa Number of kmers : 72798395 Parsing transcripts and building k-mer equivalence classes hit eof! ] ETA > 1 week
100% [=====================================================] in 32s
after relabling, there are 276081 eq classes Exiting, log location: /scratch/GT/vdas/benchmark/sailfishindex/logs/sailfish.g2log.20150728-163347.log Exception : [locale::facet::_S_create_c_locale name not valid] sailfish index was invoked improperly. For usage information, try sailfish index --help

``` Exiting.

Can you tell me is it fine or not? The output of this run is listed below

cd /scratch/GT/vdas/benchmark/sailfishindex ls -ltr drwxr-xr-x 2 vdas DPT 5 Jul 28 16:33 logs -rw-r--r-- 1 vdas DPT 4159762 Jul 28 16:33 bias_feats.txt -rw-r--r-- 1 vdas DPT 655186275 Jul 28 16:34 jf.counts_0 -rw-r--r-- 1 vdas DPT 649543742 Jul 28 16:36 transcriptome.sfi -rw-r--r-- 1 vdas DPT 291193596 Jul 28 16:36 transcriptome.sfc -rw-r--r-- 1 vdas DPT 1308438 Jul 28 16:36 transcriptome.tgm

Can you tell me if this is fine or not? Am particularly bothered about the below warning Exception : [locale::facet::_S_create_c_locale name not valid] sailfish index was invoked improperly. For usage information, try sailfish index --help Exiting.

vd4mmind commented 9 years ago

having the above output I started runnnig sailfish quant with the below command but it stopped after a while with the below error

ERROR: sailfish count (subordinate command) invoked improperly. Sailfish terminated with return code 256 sailfish: /home/robp/Sailfish-0.6.3-Source/src/QuantificationDriver.cpp:152: int runKmerCounter(const string&, uint32_t, const string&, const std::vector<ReadLibrary>&, const string&, bool): Assertionstatus == 0' failed. Aborted`

command used:

sailfish quant -i /scratch/GT/genomes/Homo_sapiens/NCBI_spiked/GRCh38/Sequence/sailfishIndexes -l "T=PE:O=><:S=U" -1 <(zcat /data/GT/pgermain/wbs/rnaseq/raw/lieber.dec2013/AJ80/*R1*.fastq.gz) -2 <(zcat /data/GT/pgermain/wbs/rnaseq/raw/lieber.dec2013/AJ80/*R2*.fastq.gz) -o /scratch/GT/benchmark/QT/sailfish/AJ80 -p 8

Can you tell me where am going wrong? it seems to be a problem of the QuantificationDriver.cpp program. Am using the binary version and not the source code. Please let me know.

rob-p commented 9 years ago

The exception you encountered is due to a known issue with Boost and certain OS setups. It's a real exception and likely means that the index didn't finish building properly. One work around is to try and "force" the expected locale when running the program; e.g.:

$ LC_ALL=C sailfish index -t /scratch/GT/genomes/Homo_sapiens/NCBI_spiked/GRCh38/Sequence/RSEM3/transcriptome.idx.fa -o /scratch/GT/vdas/benchmark/sailfishindex -k 20 -p 16

Adding the LC_ALL=C before the executable should tell it to use the expected locale. However, I've never run into this particular bug so I'm not certain this will fix it.

On an unrelated note, I highly encourage you to take a look at salmon --- pre-print available here --- which is our new and improved method for transcript quantification using lightweight alignment. The speed is comparable to Sailfish (sometimes faster), but we've found it to be more accurate especially in difficult scenarios with complex mixtures of isoforms.

vd4mmind commented 9 years ago

Thanks a lot rob-p, I have already re-ran the index forcing the LC_ALL=C and it did finish the indexing successfully. Now am using the below command of the sailfish quant to run a test on one of my samples but it throws a error at the end. Below is the command and the error.

Error: Overall rate: 123832 reads / s 1016.855500s wall, 8175.680000s user + 182.370000s system = 8358.050000s CPU (822.0%) works well till here but below gets error again ERROR: sailfish count (subordinate command) invoked improperly. Sailfish terminated with return code 256 sailfish: /home/robp/Sailfish-0.6.3-Source/src/QuantificationDriver.cpp:152: int runKmerCounter(const string&, uint32_t, const string&, const std::vector<ReadLibrary>&, const string&, bool): Assertionstatus == 0' failed. Aborted`

Can you give me a workaround as to how to resolve this? I am running the binary version and not the source so I was expecting that there would not have been any issue, I am running the code in our Institute cluster, so memory issue should not be a problem. It seems to be a problem of one of the code of QuantificationDriver, where it is trying to fetch the string for the library type when it encounters int runKmerCounter, I would appreciate if you can give me an idea to make this quant work with Sailfish.

I am already working with Salmon as well, I just want to see the outcome of Sailfish and its difference with Salmon.

vd4mmind commented 9 years ago

HI @rob-p ,

I have infact downloaded the binary version from www.cs.cmu.edu/~ckingsf/sailfish. I just unpacked it and exported the path as mentioned and ran the index and quant. The index worked as you told but as you can see still the error persist with the library type, so do you recommend me to change the handle of -l like -l "IU" and then run it? Since I believe am using the latest binary version but it has not been updated with the new library type, I am still giving it a try. Would appreciate your inputs and sorry for too much of commenting.

vd4mmind commented 9 years ago

So here I would again like to barge in stating that I could ran it without error and below is the entire output log.

Command Used

LC_ALL=C sailfish quant -i /scratch/GT/genomes/Homo_sapiens/NCBI_spiked/GRCh38/Sequence/sailfishIndexes -l "T=PE:O=><:S=U" -1 <(zcat /data/GT/pgermain/wbs/rnaseq/raw/lieber.dec2013/AJ80/R1.fastq.gz) -2 <(zcat /data/GT/pgermain/wbs/rnaseq/raw/lieber.dec2013/AJ80/R2.fastq.gz) -o /scratch/GT/benchmark/QT/sailfish/AJ80 -p 8

Run log

Version Info: This is the most recent version Library format { type:paired end, relative orientation:toward, strandedness:unstranded } reading index . . . done index contained 72798395 kmers file /dev/fd/63: reading from 1 streams reading from /dev/fd/63 hit eof!d 60500000 reads (124485) reads/s processed 62500000 reads (124501) reads/s 503.320817s wall, 4069.240000s user + 87.630000s system = 4156.870000s CPU (825.9%) file /dev/fd/62: reading from 1 streams reading from /dev/fd/62 hit eof!d 123250000 reads (122902) reads/s processed 125250000 reads (122967) reads/s 510.194351s wall, 4129.560000s user + 84.260000s system = 4213.820000s CPU (825.9%) Overall rate: 123588 reads / s There were 10146324913, kmers; 3148434897 could not be mapped Mapped 68.9697% of the kmers Total counting time [including file i/o]: 1018 seconds. 1018.175752s wall, 8199.410000s user + 172.220000s system = 8371.630000s CPU (822.2%) Sailfish terminated with return code 0 In Sailfish estimation thread. Estimating transcript abundances writing logs to /scratch/GT/benchmark/QT/sailfish/AJ80/logs Reading the transcript <-> gene map from [/scratch/GT/genomes/Homo_sapiens/NCBI_spiked/GRCh38/Sequence/sailfishIndexes/transcriptome.tgm] done Reading transcript index from [/scratch/GT/genomes/Homo_sapiens/NCBI_spiked/GRCh38/Sequence/sailfishIndexes/transcriptome.sfi] . . .done Reading read counts from [/scratch/GT/benchmark/QT/sailfish/AJ80/reads.sfc] . . .read length = 12531860200, numLengths = 125318602 done Creating optimizer . . .done optimizing using iterative optimization [1000] iterations reading Kmer equivalence classes updating Kmer group counts updating transcript map done Transcript LUT contained 50036 records Removing duplicates from kmer transcript lists ... done Computing kmer group promiscuity rates done Computing initial coverage estimates ... There were 36593 uniquely anchored transcripts There were 50036 transcripts with at least one overlapping k-mer done SQUAREM iteraton [98] 1/3 100% [=====================================================] in 0s max relative change: 0.00430594 convergence criteria met; terminating SQUAREM Writing output=============================================] in 0s 100% [=====================================================] in 0s biasFeatPath = "/scratch/GT/genomes/Homo_sapiens/NCBI_spiked/GRCh38/Sequence/sailfishIndexes/bias_feats.txt" expressionFilePath = "/scratch/GT/benchmark/QT/sailfish/AJ80/quant.sf" biasCorrectedFile = "/scratch/GT/benchmark/QT/sailfish/AJ80/quant_bias_corrected.sf" parsed 50036 features parsed 50036 expression values 437082 50430.3 7487.36 2454.25 2097.55 1197.22 963.691 772.566 599.302 540.209 507.953 338.099 301.399 0.0443967 0.028502 0.014292 0.00367797 totalVariance: 504772 ev: 437082 ev: 50430.3 95% of the variance is explained by 2 dimensions there are 36164 samples there are 3 features Random forest progress: 500/500 Out-of-bag score(RMSE,Correlation Coefficient)=(1.990085,0.023764) |Random Forest training done. | Using time: 3 secs| Train RMSE=1.98452, Correlation Coefficient=0.0298324 Train RMSE=0.295303, Correlation Coefficient=0.978518 retainedCnt = 36164, nsamps = 36164

Output Obtained

ls -ltr total 170650 -rw-r--r-- 1 vdas DPT 291193596 Jul 29 10:28 reads.sfc -rw-r--r-- 1 vdas DPT 84 Jul 29 10:28 reads.count_info drwxr-xr-x 2 vdas DPT 3 Jul 29 10:28 logs -rw-r--r-- 1 vdas DPT 2426082 Jul 29 10:28 quant.sf -rw-r--r-- 1 vdas DPT 2421581 Jul 29 10:28 quant_bias_corrected.sf

Can you please tell me if I need anything else? I actually got a bit confused with the excitement that finally the ran happened after 24 hours.

rob-p commented 9 years ago

Hi @vd4mmind,

First, I'm sorry your ran into these issues. Obviously, the point of a pre-compiled binary is to make things easier to use, but we've discovered just how difficult that can be when dynamically linked libraries interact with different versions of the linux kernel and glibc. Also, thanks for reminding me about the change in the library format. We altered it to make it consistent with how the library is specified in salmon (which also gets rid of the need to do strange escaping of the library format string on the command line).

From what you posted, it looks like your run was successful. The only thing that I would note (I've mentioned this numerous times including on our FAQ, but it's worth repeating) is that one must be careful when attempting to interpret the "mapping rate" or "estimated number of reads" when comparing them to alignment-based approaches. These numbers are calculated based on the number of exactly matching k-mers, which means that you should expect to see much more modest numbers when compared to aligned reads (i.e. this is a highly conservative estimate of the potential alignment rate). Since salmon considers entire sequenced fragments rather than individual k-mers, it doesn't have this issue, but a more realistic estimate of what the actual alignment rate might be is one of the remaining features for sailfish that will probably be implemented soon. That said, this estimation issue only applies to the "Estimated number of reads" and mapping rate, and doesn't effect the base relative abundance metrics like TPM and RPKM.

vd4mmind commented 9 years ago

@rob-p Thank you very much for the clarity. Indeed I will exploit all the chores of the abundance, mapping rate and the estimated number of reads. I am looking for light weight method where we can work it out without alignment based methods and have more or less accurate estimation of transcriptome. Then we want to asses its downstream worth. I am using both Salmon and Sailfish and trying to see which I can set for my lab as a much more establish method for transcriptome quantification.

Speaking of the libraries, indeed salmon did not encounter any problem for individual or batch run effects, however I could not do that for Sailfish, I had to run all the samples interactively with Sailfish individually, probably this issue is due to the different versions of linux and glibc. I am not using Salmon with alignment method so it should not made much of a difference, having said that Salmon quantification will be much more accurate. Thanks for the detailed write. Cheers to the new era of Transcriptomics!!

rob-p commented 9 years ago

Sure --- your use case sounds ideal for both of the tools really! I just want to stress that, while Salmon, being a successor to Sailfish, is a general improvement in many ways (and I do believe that it will generally give more accurate estimates, especially when complex mixtures of isoforms are involved), this isn't meant to imply that the Sailfish results will be inaccurate. Consider, for example, this paper that was just published --- in includes Sailfish but not yet Salmon --- where, over a broad range of different tests, Sailfish seems to provide comparable accuracy to the large array of alignment-based methods tested. I'm going to close this ticket for now, but please just let us know if you have any more question!

vd4mmind commented 9 years ago

Sure, please close the ticket, actually I was going through this paper this afternoon , thanks for all the inputs

rob-p commented 9 years ago

FYI; sailfish has received a substantial update as of version v0.7.0. The issue related to the estimated number of reads no longer applies, and you can expect it to be an accurate estimate of the true number of fragments originating from a transcript.