pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
634 stars 168 forks source link

bustools capture (0.39.3) tutorial error #224

Open hmassalha opened 4 years ago

hmassalha commented 4 years ago

Hi, I am running over the tutorial of 'RNA velocity with kallisto | bus and velocyto.R'.... and I have two lines that I can't run: bustools capture -o cDNA_capture/ -c ./cDNA_tx_to_capture.txt -e matrix.ec -t transcripts.txt output.correct.sort.bus bustools capture -o introns_capture/ -c ./introns_tx_to_capture.txt -e matrix.ec -t transcripts.txt output.correct.sort.bus

I am getting the following error message for both commands:

$ bustools capture -o introns_capture/ -c ./introns_tx_to_capture.txt -e matrix.ec -t transcri pts.txt output.correct.sort.bus Error: capture list type must be specified (one of -s, -u, or -b) Usage: bustools text [options] bus-files

Options: -o, --output File for text output -p, --pipe Write to standard output

seems like a typo or missing flag. I tried several with no success, this is why I am reporting this here.

Thanks, HM

lambdamoses commented 4 years ago

Sorry for the confusion. The notebook was written with bustools 0.39.2. There are some changes in bustools 0.39.3 that caused this problem. As bustools capture now also supports barcodes and UMIs, you need to tell it that it's transcripts that should be captured with the flag -s. The following command worked:

bustools capture -s -o cDNA_capture.bus -c ./cDNA_tx_to_capture.txt -e matrix.ec -t transcripts.txt output.correct.sort.bus

Note that in version 0.39.2, the output argument should be a directory, to which bustools capture writes multiple files. In contrast, in version 0.39.3, the output argument is a file, and bustools capture writes one bus file.

I noticed that in the RNA velocity R notebook, the new bustools resulted into fewer total counts and fewer detected genes than the older version, though the biologically relevant results don't seem to be significantly affected. I don't know why.

The RNA velocity notebook has been updated for the new version of bustools. The URL has not changed.

hmassalha commented 4 years ago

Thanks a lot. Soon I will check the suggested command. I tried your second suggestion, but was not successful as I am not getting the relevant files in the folders for the following commands.

I read about the capturing from UMIs, what that means? Dose it allow the barcodes to be concatenated with UMIs? Not sure I understand the exact purpose of using UMIs.. Could you explain with two sentences please?

Thanks, HM

Sent from my iPhone

On 22 Aug 2019, at 2:03, Lambda Moses notifications@github.com wrote:

Sorry for the confusion. The notebook was written with bustools 0.39.2. There are some changes in bustools 0.39.3 that caused this problem. As bustools capture now also supports barcodes and UMIs, you need to tell it that it's transcripts that should be captured with the flag -s. The following command worked:

bustools capture -s -o cDNA_capture.bus -c ./cDNA_tx_to_capture.txt -e matrix.ec -t transcripts.txt output.correct.sort.bus — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

lambdamoses commented 4 years ago

Sorry, only my last suggestion is correct. The previous two did not work as expected, so I deleted them. The output has changed too. The bustools capture command no longer outputs the captured and split files the older version did. Now the argument to -o needs to be a file name, where a single bus file will be written, and if it's a directory, then you will not get the outputs.

I'm not sure what capturing UMI is used for either, since I'm not involved in developing bustools itself, though I wrote the R notebooks. I guess it may be used to filter the bus file based on a list of UMIs that you want, just like how in transcript capture, you only keep the reads for the list of transcripts you want. I think @laureneliu implemented the new bustools capture features and she's in a better position to answer your question.

hmassalha commented 4 years ago

Thanks a lot, The bustools capture worked for me and I see the newly generated files. The bustools count is also worked for me but the output is not written in the spliced and unspliced folders as supposed to be. Might be you are missing a / at the -o flag? Once you write the files to the folders they will appear without the prefix (spliced/unspliced) which will prevent the following R code to work!

HM

lambdamoses commented 4 years ago

It's not written in the spliced and unspliced directories. Instead, it's directly written to the working directory, though the directory called "spliced" and "unspliced" are created automatically by bustools anyway, though those directories remain empty. The R code has been modified to reflect this change. If you want to put the files in the spliced and unspliced directories, you should put something like spliced/s for -o; then the output will be in the spliced directory, in which another empty directory s is created. Using bustools count with -o spliced/ will not write any output.

hmassalha commented 4 years ago

Ok, thanks just fix it. Thought you can correct the command mkdir spliced/ unspliced/ in the notebook if there is no need for the two directories.

I am adding here another question that I posted in the google groups, hope you can help me with this (or if you prefer I can open a new issue on GitHub) I just wasted a lot of time on velocity without and progress and probably Kallisto is what I need. https://groups.google.com/forum/#!topic/kallisto-and-applications/9AGqBlef-Gg

Thanks in advance, HM

lambdamoses commented 4 years ago

Oops, forgot to remove that line. I haven't worked with MARS-seq data before, but I have processed a 10x dataset with multiple samples. I processed the samples separately with kallisto and bustools to get separate matrices and then used Seurat's data integration to correct for batch effect and combine the matrices; which sample each cell is from was added to the metadata. As for the plates, if you don't want to process them separately, I suppose you can concatenate the plate barcode and the cell barcode and pretend that the concatenated sequence is cell barcode, so barcodes from different plates can be distinguished. I don't think the bus format has a field for plate barcode by itself.

hmassalha commented 4 years ago

Thanks a lot for your time and your help.

I am working with human samples, for now, I ran the pipeline with one sample. I have one fastq file with plate barcode concatenated. But I deleted in order to make the bus command to run smoothly, thus I got 384 cells out of ~2000 (meaning the bus command look for the unique barcodes! which is not good!)

I will rerun the pipeline with separated fastq files (produced by the bcl2fastq function). In this case, each file (plate) is 384 cells (barcodes). Hopefully, the bus tool will concatenate all plates together as they are coming from one sample.

What I am still missing is how to save the sample ID. It is important at the end of the day to tell the contribution of each human (sample) to the final output that I get. I tried the integrated function of Seurat, worked well for me! The question is how you tell velocity.R the origin of each cell (from which sample came?)

Thanks again, HM

lambdamoses commented 4 years ago

In Seurat data integration, you start with a list of Seurat objects, each object for one sample. Add the sample ID to a metadata column to the corresponding Seurat object and the sample ID will still be kept in the integrated object to tell which cell comes from which sample. When creating each object in the list of Seurat objects, you can add the sample ID like this: CreateSeuratObject(mat, project = "sample1"), so "sample1" will be added to the orig.ident column of metadata. Or you can add the sample ID later by something like seu$sample_id <- "sample1" before integrating data.

hmassalha commented 4 years ago

I did what you are suggesting in a separate analysis. But I still didn't get to the end of the tutorial to generate the velocity maps. Know I see that you use the Seurat object as an input for velocity. I will try it and will let you know what I have. Thanks a lot for your help. HM

lambdamoses commented 4 years ago

What's the problem? You need the SeuratWrapper package to directly use Seurat object for velocity.

hmassalha commented 4 years ago

The Seurat object that I have built after running zUMIs with STAR. I think I am missing the spanning exon-intron reads (this is what zUMIs people told me!) I tried running the velocyto using the BAM file generated by the zUMIs (recent version). But get very low representation for both spliced and unspliced reads. (might be because of how velocyto is treating the reads. This was also mentioned in your notebook:

  1. https://github.com/velocyto-team/velocyto.py/issues/148 2.here are more unspliced counts than spliced counts, which has been observed in multiple datasets. In contrast, for velocyto, the unspliced count is usually between 10% and 20% of the sum of spliced and unspliced. Perhaps this is because kallisto | bus counts reads that are partially intronic and partially exonic as unspliced while velocyto throws away many reads (see this GitHub issue).

This is why I am staring everything from the beginning using kallisto.

hmassalha commented 4 years ago

Hi, I have a problem running the velocyto.R on Rstudio server version 3.5.1 I am getting the following message:

`/usr/bin/ld: cannot find -lboost_filesystem /usr/bin/ld: cannot find -lboost_system collect2: error: ld returned 1 exit status make: *** [/apps/RH7U2/gnu/R/3.5.1/lib64/R/share/make/shlib.mk:6: velocyto.R.so] Error 1 ERROR: compilation failed for package ‘velocyto.R’

Hopefully, you have some experience with this type of error. I looked on the internet what others asked and suggested, but without any success!

Thanks in advance, HM

lambdamoses commented 4 years ago

Have you installed the boost C++ library? Also, you need to make sure that the compiled part of boost is installed, not just the headers. filesystem and system belong to the compiled part. Boost might have already been installed on your server. You can use locate libboost_system.so to find existing installation locations. If boost is not already installed, see https://www.boost.org/doc/libs/1_71_0/more/getting_started/unix-variants.html for installation instructions, or if you are an admin, use something like sudo apt-get install libboost-dev. Also, make sure that R knows where your boost installation is, by adding the directory containing the libboost_*.so files to LD_LIBRARY_PATH in .Renviron.

hmassalha commented 4 years ago

Hi, Thanks for your help. I am getting frustrated with installing this! I am not a computer scientist. I am sure I have a problem with the linkers and the installed version, I made a BIG mess! I have three boost modules from the IT unit (the output of module load): boost/1.55.0 boost/1.62.0 boost/1.69.0 which one is the best for our needs?

when I tried to run locate libboost_system.so command I get /usr/lib64/libboost_system.so.1.53.0 (without loading any of the above modules!

Could you please hint me what to do next? I am really confused! Thanks in advance, HM

lambdamoses commented 4 years ago

I don't think velocyto.R requires a minimum version of boost. It looks like you have boost 1.53.0 installed in /usr/lib64. Somehow R doesn't recognize it. On my server, I did successfully install velocyto.R without adding anything to .Renviron, nor do I have a ~/.R/Makevars file; the default settings should be able to find the boost installation in /usr/lib64. Can you post your entire installation log for velocyto.R? Also, can you find the files libboost_system.so and libboost_filesystem.so, without the 1.53.0?

hmassalha commented 4 years ago

Hi,

Here is the log for the installation of velocyto.R:

devtools::install_github("velocyto-team/velocyto.R") Downloading GitHub repo velocyto-team/velocyto.R@master Your system is ready to build packages! Skipping 1 packages not available: pcaMethods ✔ checking for file ‘/tmp/RtmpzfIUQn/remotes62b07495ee77/velocyto-team-velocyto.R-d779034/DESCRIPTION’ ... ─ preparing ‘velocyto.R’: ✔ checking DESCRIPTION meta-information ... ─ cleaning src ─ checking for LF line-endings in source and make files and shell scripts ─ checking for empty or unneeded directories ─ building ‘velocyto.R_0.6.tar.gz’

Installing package into ‘/home/labs/shalev/hassanm/R/x86_64-pc-linux-gnu-library/3.5’ (as ‘lib’ is unspecified)

  • installing source package ‘velocyto.R’ ... ** libs g++ -std=gnu++11 -I"/apps/RH7U2/gnu/R/3.5.1/lib64/R/include" -DNDEBUG -I"/apps/RH7U2/gnu/R/3.5.1/lib64/R/library/Rcpp/include" -I"/apps/RH7U2/gnu/R/3.5.1/lib64/R/library/RcppArmadillo/include" -I/usr/local/include -fopenmp -fpic -g -O2 -c RcppExports.cpp -o RcppExports.o g++ -std=gnu++11 -I"/apps/RH7U2/gnu/R/3.5.1/lib64/R/include" -DNDEBUG -I"/apps/RH7U2/gnu/R/3.5.1/lib64/R/library/Rcpp/include" -I"/apps/RH7U2/gnu/R/3.5.1/lib64/R/library/RcppArmadillo/include" -I/usr/local/include -fopenmp -fpic -g -O2 -c points_within.cpp -o points_within.o g++ -std=gnu++11 -I"/apps/RH7U2/gnu/R/3.5.1/lib64/R/include" -DNDEBUG -I"/apps/RH7U2/gnu/R/3.5.1/lib64/R/library/Rcpp/include" -I"/apps/RH7U2/gnu/R/3.5.1/lib64/R/library/RcppArmadillo/include" -I/usr/local/include -fopenmp -fpic -g -O2 -c routines.cpp -o routines.o g++ -std=gnu++11 -shared -L/apps/RH7U2/gnu/R/3.5.1/lib64/R/lib -L/usr/local/lib64 -o velocyto.R.so RcppExports.o points_within.o routines.o -lboost_filesystem -lboost_system -lstdc++ -L/apps/RH7U2/gnu/R/3.5.1/lib64/R/lib -lRlapack -L/apps/RH7U2/gnu/R/3.5.1/lib64/R/lib -lRblas -fopenmp -lgfortran -lm -lquadmath -L/apps/RH7U2/gnu/R/3.5.1/lib64/R/lib -lR /usr/bin/ld: cannot find -lboost_filesystem /usr/bin/ld: cannot find -lboost_system collect2: error: ld returned 1 exit status make: *** [/apps/RH7U2/gnu/R/3.5.1/lib64/R/share/make/shlib.mk:6: velocyto.R.so] Error 1 ERROR: compilation failed for package ‘velocyto.R’
  • removing ‘/home/labs/shalev/hassanm/R/x86_64-pc-linux-gnu-library/3.5/velocyto.R’ Error: Failed to install 'velocyto.R' from GitHub: (converted from warning) installation of package ‘/tmp/RtmpzfIUQn/file62b03e82188a/velocyto.R_0.6.tar.gz’ had non-zero exit status`

I didn't understand the last request

Also, can you find the files libboost_system.so and libboost_filesystem.so, without the 1.53.0?

I tried the following: if I run locate libboost_system.so I get /usr/lib64/libboost_system.so.1.53.0 if I run locate libboost_filesystem.so I get /usr/lib64/libboost_filesystem.so.1.53.0

Thanks in advance, HM

lambdamoses commented 4 years ago

To be honest, I don't know what really is going on either. I wonder if it's because you can't have the 1.53.0. I kind of wonder if installing a new version of boost in your home directory (including the system and filesystem compiled parts) and tell R were it is by adding where the new libboost_system.so and libboosst_filesystem.so are to LD_LIBRARY_PATH in .Renviron would help. Or with less work, create symbolic links to /usr/lib64/libboost_system.so.1.53.0 and /usr/lib64/libboost_filesystem.so.1.53.0 in your home directory, with file names stopping at the "so", no version numbers, and add the directory containing the symbolic links to LD_LIBRARY_PATH. Honestly, those are just my guesses and I don't know if they will work.

hmassalha commented 4 years ago

Thanks for your help. Really I just wasted a lot of time trying to make this working! I will try also to save the seurat object and move to work locally (window OS) rather than straggling with server commands and permissions.

Thanks a again, HM

Sent from my iPhone

On 24 Aug 2019, at 10:20, Lambda Moses notifications@github.com wrote:

To be honest, I don't know what really is going on either. I wonder if it's because you can't have the 1.53.0. I kind of wonder if installing a new version of boost in your home directory (including the system and filesystem compiled parts) and tell R were it is by adding where the new libboost_system.so and libboosst_filesystem.so are to LD_LIBRARY_PATH in .Renviron would help. Or with less work, create symbolic links to /usr/lib64/libboost_system.so.1.53.0 and /usr/lib64/libboost_filesystem.so.1.53.0 in your home directory, with file names stopping at the "so", no version numbers, and add the directory containing the symbolic links to LD_LIBRARY_PATH. Honestly, those are just my guesses and I don't know if they will work.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

lambdamoses commented 4 years ago

The bustools capture command of the previous version of the notebook has been changed to make results more consistent with velocyto. The command is now

bustools capture -s -x -o spliced.bus -c ./introns_tx_to_capture.txt -e matrix.ec -t transcripts.txt output.correct.sort.bus

See the notebook for explanation.

hmassalha commented 4 years ago

Hi, Thanks for your help. I managed to solve the problem. I am getting another type of problem with the RunVelocitycommand. I getting the following message:

> seu <- RunVelocity(seu, ncores = 1, reduction = "pca", verbose = TRUE)
Filtering genes in the spliced matrix
Filtering genes in the unspliced matrix
Calculating embedding distance matrix
calculating cell knn ... done
calculating convolved matrices ... done
fitting gamma coefficients ... Error in lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok,  : 
  0 (non-NA) cases

I found that the error is that I have NAs matrices. Do you have any suggesstion why is that? and how I can overcome this error?

Thanks, HM

hmassalha commented 4 years ago

I have another suggestion for treating multiple samples. Do you think it is a good idea to attach artificial barcodes (2 letters) to the cell barcodes at the fastq files level for each sample, and then run the analysis for all cells at once (of course the white list will be updated accordingly). What do you think? Thanks, HM

lambdamoses commented 4 years ago

I think the artificial barcodes should work, as long as it distinguishes between barcodes in different samples. I don't think there's supposed to be NA in the matrix. Are the NA's in the raw count matrix or the SCTransform normalized matrix? Also, are there cells with too few genes detected in your data?

hmassalha commented 4 years ago

Hi, I don't have any NA in my data...

> which(is.na(as.matrix(seu@assays$spliced@counts)))
integer(0)
> which(is.na(as.matrix(seu@assays$spliced@data)))
integer(0)
> which(is.na(as.matrix(seu@assays$spliced@scale.data)))
integer(0)
> which(is.na(as.matrix(seu@assays$unspliced@counts)))
integer(0)
> which(is.na(as.matrix(seu@assays$unspliced@data)))
integer(0)
> which(is.na(as.matrix(seu@assays$unspliced@scale.data)))

here is the number of counts of the spliced and the unspliced matrices.. sorted sum counts - unspliced sorted sum counts - spliced

do you have any suggestion why I am getting the following error:

> seu <- RunVelocity(seu, ncores = 1, reduction = "pca", verbose = TRUE)
Filtering genes in the spliced matrix
Filtering genes in the unspliced matrix
Calculating embedding distance matrix
calculating cell knn ... done
calculating convolved matrices ... done
fitting gamma coefficients ... Error in lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok,  : 
  0 (non-NA) cases

Thanks, HM

lambdamoses commented 4 years ago

Your QC doesn't look all too unusual. This is the code that caused the problem: https://github.com/velocyto-team/velocyto.R/blob/d7790346cb99f49ab9c2b23ba70dcf9d2c9fc350/R/momentum_routines.R#L305 I wonder if something weird happened to the convolved matrix.

hmassalha commented 4 years ago

Hi, Thanks again for your help. I managed to run the pipeline to the end. My previous problem caused by low-quality cells. After applying more stringent preprocessing steps I managed to run it to the end. Thanks again for your help. So I have 6 samples, I prepared the 6 BUS files for each. The files were merged after loading them to R, but before making the Seurat object.

For part of the cell types, I got biologically incorrect velocity vectors. My first question is what one can do to test the validity of the results? or, where in the pipeline I can modify things/optimise some inputs to make the pipeline better represent my data? (i.e. collapse/separate isoforms...)

The second question is related to tSNE embeddings. I have my data in a Seurat project (SeuObj) that was prepared from exon reads using zUMIs and Seurat pipeline and I am happy with it. If I understand correctly, the vectors of the velocity have to be related to tSNE x y coordinates. If I bring the tSNE coordinates from the SeuObj that I have already, do I have to recalculate the velocity vectors? I mean vector is the state for a cell in a given x y coordinates. bringing x y coordinates from another Seurat object will not represent the exact state of a cell because it has new neighbors in this case. What is the best way to copy data from an existing Seurat object for velocity needs?

I hope I explained my self. Thanks in advance, HM

lambdamoses commented 4 years ago

About biologically incorrect velocity vectors: That's my fault. In a previous version of BUSpaRse, if you choose to extract the transcriptome from the genome based on annotation, then the cDNA capture list might contain transcripts absent from the transcriptome but somehow are present in the annotation. This caused bustools capture to give wrong results for the unspliced matrix, without giving a helpful error message. In the most recent version of BUSpaRse, this problem has been fixed so all transcripts in the cDNA capture list will be in the transcriptome, and this did quite significantly change the velocity arrows when I rebuilt that notebook.

Also, in a previous version of BUSpaRse, there was a bug in the way the flanked intronic sequences are extracted: Exons of size between L-1 and 2*(L-1) will be included in the flanked intronic sequences, causing overestimation of unspilced counts. This has been fixed in the most recent version of BUSpaRse. When I rebuilt that notebook after this update, the velocity results somewhat changed, though not very dramatically. Now BUSpaRse is on Bioconductor 3.10 (the development version). Also note that my RNA velocity tutorial has been updated. The bustools capture command in the tutorial has also been updated to be more consistent with velocyto.

Yes, the vector projection based on correlation does depend on the specific x y coordinates of the dimension reduction, so if you have a different non-linear dimension reduction, you do need to recalculate the projected arrows.

lambdamoses commented 4 years ago

I implemented the option for whether to collapse isoform when I wasn't sure whether it would be better to do so. I thought more carefully about it and now I think it might be better to keep isoforms separate; collapsing isoforms might inflate spliced counts. So I updated the RNA velocity notebook and added an explanation of why I decided to keep the isoforms separate. I think for the RNA velocity results to be valid, it should at least be consistent with existing biological knowledge, such as cell type annotation and known cell lineages. But there might still be surprises; RNA velocity has help discovering new cell populations in a few studies.

hmassalha commented 4 years ago

Thanks for your input, I think in my case the separate isoforms are not optimal. I will test again with the collapsed setting. After updating the BUSpaRse I am getting more intronic reads, hoping this will improve the quantifications. Do you think the Ensembl annotation version will affect the velocity map dramatically? Thanks, HM

lambdamoses commented 4 years ago

Qualitatively, I don't think different Ensembl versions will cause dramatic differences, but probably you need to see for yourself for your dataset.

hmassalha commented 4 years ago

Hi, I didn't find in the pipeline where you filter for genes. Are you using all genes as an input for estimating the velocity vectors? or you let the RunVelocity function to decide which genes to use. I see in other methods they care about which genes feed to velocity estimation http://pklab.med.harvard.edu/velocyto/notebooks/R/SCG71.nb.html.

Any suggestion? Thanks, HM

lambdamoses commented 4 years ago

RunVelocity filters genes behind the scene the same way as the velocyto vignette you referred to, using the spliced.average and unspliced.average arguments. See the source code: https://github.com/satijalab/seurat-wrappers/blob/1e814d16664b73a25b1430a3928641e91f6f8759/R/velocity.R#L201