velocyto-team / velocyto.py

RNA velocity estimation in Python
http://velocyto.org/velocyto.py/
BSD 2-Clause "Simplified" License
160 stars 83 forks source link

Running velocyto on bulk PE RNA-seq? #130

Open snaqvi1990 opened 6 years ago

snaqvi1990 commented 6 years ago

Hello,

This is more of a question than an issue, but I would like to run velocyto on bulk, 100x100 RNA-seq. I have 6 samples currently, all with >30m reads. Is there any particular way you would suggest that this could be done? I tried run_smartseq2 on the samples, one .bam file at a time, but it crashed...

Thanks, Sahin

scfurl commented 6 years ago

I am attempting this as well... I tried the following command:

velocyto run -U "file.bam" "file.gtf"

I get the error as follows which is weird because I am using the -U flag:

raise IOError("The bam file does not contain cell and umi barcodes appropriatelly formatted. If you are runnin UMI-less data you should use the -U flag.") OSError: The bam file does not contain cell and umi barcodes appropriatelly formatted. If you are runnin UMI-less data you should use the -U flag.

Any help would be appreciated

scfurl commented 6 years ago

Follow up... I followed the advice of snower2010 from this post https://github.com/velocyto-team/velocyto.py/issues/107 and used simplesam https://simplesam.readthedocs.io/en/latest/#indices-and-tables to look at my bam file output from star (after sorting). I noticed that the bam files (of course) did not have a 'CB' or 'UB'. I used simplesam to add dummy CB and UB and now velocyto software appears to run without error. Hope this helps someone else

scfurl commented 6 years ago

from simplesam import Reader, Writer in_file = open('in.bam', 'r') in_sam = Reader(in_file) x = next(in_sam) x.tags

with Reader(open('in.bam')) as in_bam: with Writer(open('out.sam', 'w'), in_bam.header) as out_sam: for read in in_bam:

print(read.qname)

        read[umi_tag] = read.qname.split(":")[2] # add the umi tag
        read[barcode_tag] = read.qname.split(":")[1] # add the barcode tag
        out_sam.write(read)
z5ouyang commented 5 years ago

Following on this, do you get the arrows on each sample like Figure 1h? I tried several ways to alter the scRNAseq pipeline without success (not an arrow on each sample), this is what I have tried (any suggestion is welcomed):

  1. Loading the combined loom files and plot the fraction (which works) vlm = vcy.VelocytoLoom(strDT) vlm.normalize("S", size=True, log=True) fractionPlot = vlm.plot_fractions() fractionPlot.savefig(strPath+"fraction.pdf",bbox_inches="tight")
  2. Setting the clusters, normalization and PCA vlm.set_clusters(cluster_labels=...,cluster_colors_dict=...) vlm._normalize_S(relative_size=vlm.S.sum(0),target_size=vlm.S.sum(0).mean()) vlm._normalize_U(relative_size=vlm.U.sum(0),target_size=vlm.U.sum(0).mean()) vlm.perform_PCA() plt.plot(np.cumsum(vlm.pca.explained_varianceratio)[:20]) n_comps = np.where(np.diff(np.diff(np.cumsum(vlm.pca.explained_varianceratio))>0.006))[0][0] plt.axvline(n_comps, c="k"); plt.savefig(strPath+"fit.pdf",bbox_inches="tight") plt.close()
  3. KNN, gamma fit and velocity (I think this is tricky part, maybe I made misunderstanding here) vlm.knn_imputation(k=1,n_pca_dims=n_comps) vlm.fit_gammas() vlm.predict_U() vlm.calculate_velocity() vlm.calculate_shift(assumption="constant_velocity") vlm.extrapolate_cell_at_t(delta_t=1.)
  4. estimate transition (got warning for estimate_transition_prob: WARNING:root:Nans encountered in corrcoef and corrected to 1s. If not identical cells were present it is probably a small isolated cluster converging after imputation) vlm.pc = vlm.pcs[:,range(0,2)] vlm.estimate_transition_prob(hidim="Sx_sz", embed="pc", transform="sqrt", psc=1, n_neighbors=2, sampled_fraction=1,threads=4) vlm.calculate_embedding_shift(sigma_corr = 0.1, expression_scaling=True) vlm.calculate_grid_arrows(smooth=0.5, n_neighbors=1,steps=(5, 5))
  5. plot (which only a few arrows, 21 are total samples) vlm.plot_arrows_embedding(choice=21,scatter_kwargs={"alpha":1, "lw":0.01})
xiebb123456 commented 5 years ago

@z5ouyang I've also redraw the figure 1h, but the sample and arrow are not form a circle. Do you have any progress?

gioelelm commented 5 years ago

I think you should do a feature selection using genes that seem to vary across the time course. Then stop at step 3 of the analysis @z5ouyang is doing and then plot the pca of vlm.Sx_sz and draw arrows that point from vlm.Sx_sz to vlm.Sx_sz + k * self.delta_S, where k is a scaling constant to allow good visualization. Also check the fit of gammas, in this case with extremely few points is working well, you might want to use trivial regression with intercept because the quantile fit would basically pass a line through 2 points.

z5ouyang commented 5 years ago

I think at the end, I followed this one: http://pklab.med.harvard.edu/velocyto/notebooks/R/chromaffin2.nb.html

biostat0903 commented 4 years ago

Hi velocyto team, I followed the replies above, but velocyto reports an error in my file. I use three steps as following:

  1. sort the HG00096.1.M_111124_6.bam file
    samtools sort HG00096.1.M_111124_6.bam > sort_HG00096.1.M_111124_6.bam
  2. umi modification
    import simplesam
    barcode_tag = 'CB'
    umi_tag = 'UB'
    with simplesam.Reader(open("sort_HG00096.1.M_111124_6.bam")) as in_bam:
    with simplesam.Writer(open("umi_sort_HG00096.1.M_111124_6.sam", 'w'), in_bam.header) as out_sam:
        for read in in_bam:
            read[umi_tag] = read.qname.split(":")[2] # add the umi tag
            read[barcode_tag] = read.qname.split(":")[1] # add the barcode tag
            out_sam.write(read)
    # sam file to bam file
    samtools view -S -b umi_sort_HG00096.1.M_111124_6.sam > umi_sort_HG00096.1.M_111124_6.bam
  3. velocyto
    gtffile=gencode.v19.chr_patch_hapl_scaff.annotation.gtf
    bamfile=umi_sort_HG00096.1.M_111124_6.bam
    velocyto run ${bamfile} ${gtffile} -U -c

    The error is

    
    if unique and read.get_tag("NH") != 1:
    File "pysam/libcalignedsegment.pyx", line 2395, in pysam.libcalignedsegment.AlignedSegment.get_tag
    File "pysam/libcalignedsegment.pyx", line 2437, in pysam.libcalignedsegment.AlignedSegment.get_tag
    KeyError: "tag 'NH' not present"

Could you tell me how to deal with the error? Many thanks
Best, 
Sheng
biostat0903 commented 4 years ago

Hi, I solve the error by adding NH information, following #198 . But there still some errors. I give the details in #237. Thanks! Best, Sheng