velocyto-team / velocyto.py

RNA velocity estimation in Python
http://velocyto.org/velocyto.py/
BSD 2-Clause "Simplified" License
160 stars 83 forks source link

no BGZF EOF marker; file may be truncated #106

Open Linda-Lan opened 6 years ago

Linda-Lan commented 6 years ago

Hi velocyto team,

I ran 10x cell ranger count output files with the following commend. But no loom. file that supposed to be generated.

velocyto run10x /project/wilsonp/linda/319-5_prime /project/wilsonp/linda/refdata-cellranger-GRCh38-1.2.0/genes/genes.gtf

It shows error: [lindalan@midway-login2 sbatch]$ vim velocyto.slurm.e49102211

/home/lindalan/anaconda3/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters Traceback (most recent call last): File "/home/lindalan/anaconda3/bin/velocyto", line 11, in sys.exit(cli()) File "/home/lindalan/anaconda3/lib/python3.6/site-packages/click/core.py", line 722, in call return self.main(args, kwargs) File "/home/lindalan/anaconda3/lib/python3.6/site-packages/click/core.py", line 697, in main rv = self.invoke(ctx) File "/home/lindalan/anaconda3/lib/python3.6/site-packages/click/core.py", line 1066, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/lindalan/anaconda3/lib/python3.6/site-packages/click/core.py", line 895, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/lindalan/anaconda3/lib/python3.6/site-packages/click/core.py", line 535, in invoke return callback(args, **kwargs) File "/home/lindalan/anaconda3/lib/python3.6/site-packages/velocyto/commands/run10x.py", line 106, in run10x samtools_memory=samtools_memory, dump=dump, verbose=verbose, additional_ca=additional_ca) File "/home/lindalan/anaconda3/lib/python3.6/site-packages/velocyto/commands/_run.py", line 229, in _run results = exincounter.count(bamfile_cellsorted, multimap=multimap) # NOTE: we would avoid some millions of if statements evalution if we write two function count and count_with output File "/home/lindalan/anaconda3/lib/python3.6/site-packages/velocyto/counter.py", line 754, in count for r in self.iter_alignments(bamfile, unique=not multimap): File "/home/lindalan/anaconda3/lib/python3.6/site-packages/velocyto/counter.py", line 249, in iter_alignments fin = pysam.AlignmentFile(bamfile) # type: pysam.AlignmentFile File "pysam/libcalignmentfile.pyx", line 734, in pysam.libcalignmentfile.AlignmentFile.cinit File "pysam/libcalignmentfile.pyx", line 944, in pysam.libcalignmentfile.AlignmentFile._open File "pysam/libchtslib.pyx", line 366, in pysam.libchtslib.HTSFile.check_truncation OSError: no BGZF EOF marker; file may be truncated

Here is the file generated so far:

[lindalan@midway-login2 319-5_prime]$ cd outs [lindalan@midway-login2 outs]$ ls analysis cellsorted_possorted_genome_bam.bam.tmp.0036.bam cellsorted_possorted_genome_bam.bam cellsorted_possorted_genome_bam.bam.tmp.0037.bam cellsorted_possorted_genome_bam.bam.tmp.0000.bam cellsorted_possorted_genome_bam.bam.tmp.0038.bam cellsorted_possorted_genome_bam.bam.tmp.0001.bam cellsorted_possorted_genome_bam.bam.tmp.0039.bam cellsorted_possorted_genome_bam.bam.tmp.0002.bam cellsorted_possorted_genome_bam.bam.tmp.0040.bam cellsorted_possorted_genome_bam.bam.tmp.0003.bam cellsorted_possorted_genome_bam.bam.tmp.0041.bam cellsorted_possorted_genome_bam.bam.tmp.0004.bam cellsorted_possorted_genome_bam.bam.tmp.0042.bam cellsorted_possorted_genome_bam.bam.tmp.0005.bam cellsorted_possorted_genome_bam.bam.tmp.0043.bam cellsorted_possorted_genome_bam.bam.tmp.0006.bam cellsorted_possorted_genome_bam.bam.tmp.0044.bam cellsorted_possorted_genome_bam.bam.tmp.0007.bam cellsorted_possorted_genome_bam.bam.tmp.0045.bam cellsorted_possorted_genome_bam.bam.tmp.0008.bam cellsorted_possorted_genome_bam.bam.tmp.0046.bam cellsorted_possorted_genome_bam.bam.tmp.0009.bam cellsorted_possorted_genome_bam.bam.tmp.0047.bam cellsorted_possorted_genome_bam.bam.tmp.0010.bam cellsorted_possorted_genome_bam.bam.tmp.0048.bam cellsorted_possorted_genome_bam.bam.tmp.0011.bam cellsorted_possorted_genome_bam.bam.tmp.0049.bam cellsorted_possorted_genome_bam.bam.tmp.0012.bam cellsorted_possorted_genome_bam.bam.tmp.0050.bam cellsorted_possorted_genome_bam.bam.tmp.0013.bam cellsorted_possorted_genome_bam.bam.tmp.0051.bam cellsorted_possorted_genome_bam.bam.tmp.0014.bam cellsorted_possorted_genome_bam.bam.tmp.0052.bam cellsorted_possorted_genome_bam.bam.tmp.0015.bam cellsorted_possorted_genome_bam.bam.tmp.0053.bam cellsorted_possorted_genome_bam.bam.tmp.0016.bam cellsorted_possorted_genome_bam.bam.tmp.0054.bam cellsorted_possorted_genome_bam.bam.tmp.0017.bam cellsorted_possorted_genome_bam.bam.tmp.0055.bam cellsorted_possorted_genome_bam.bam.tmp.0018.bam cellsorted_possorted_genome_bam.bam.tmp.0056.bam cellsorted_possorted_genome_bam.bam.tmp.0019.bam cellsorted_possorted_genome_bam.bam.tmp.0057.bam cellsorted_possorted_genome_bam.bam.tmp.0020.bam cellsorted_possorted_genome_bam.bam.tmp.0058.bam cellsorted_possorted_genome_bam.bam.tmp.0021.bam cellsorted_possorted_genome_bam.bam.tmp.0059.bam cellsorted_possorted_genome_bam.bam.tmp.0022.bam cellsorted_possorted_genome_bam.bam.tmp.0060.bam cellsorted_possorted_genome_bam.bam.tmp.0023.bam cellsorted_possorted_genome_bam.bam.tmp.0061.bam cellsorted_possorted_genome_bam.bam.tmp.0024.bam cellsorted_possorted_genome_bam.bam.tmp.0062.bam cellsorted_possorted_genome_bam.bam.tmp.0025.bam cellsorted_possorted_genome_bam.bam.tmp.0063.bam cellsorted_possorted_genome_bam.bam.tmp.0026.bam cloupe.cloupe cellsorted_possorted_genome_bam.bam.tmp.0027.bam filtered_gene_bc_matrices cellsorted_possorted_genome_bam.bam.tmp.0028.bam filtered_gene_bc_matrices_h5.h5 cellsorted_possorted_genome_bam.bam.tmp.0029.bam metrics_summary.csv cellsorted_possorted_genome_bam.bam.tmp.0030.bam molecule_info.h5 cellsorted_possorted_genome_bam.bam.tmp.0031.bam possorted_genome_bam.bam cellsorted_possorted_genome_bam.bam.tmp.0032.bam possorted_genome_bam.bam.bai cellsorted_possorted_genome_bam.bam.tmp.0033.bam raw_gene_bc_matrices cellsorted_possorted_genome_bam.bam.tmp.0034.bam raw_gene_bc_matrices_h5.h5 cellsorted_possorted_genome_bam.bam.tmp.0035.bam web_summary.html

Do you have any solutions? Thank you!

gioelelm commented 6 years ago

This doesn't look like a problem in velocyto but rather a problem of its dependencies pysam/samtools. It could also be that the file is corrupted or something like that. Does 'samtools view BAMFILE' work? Are you using conda?

Linda-Lan commented 6 years ago

Hi gioelelm,

I then realized I need to delete the old bam files since it seems velocyto will skip this step if old bam files exist. I re-run and it successfully generate .loom file. I run the analysis according to python tutorial. It shows error as the following. What is ClusterName I need to put? Also, do you docker image on DockerHub? Is it possible to have analysis pipeline for R studio?

[lindalan@midway-login1 velocyto]$ python analysis.py /home/lindalan/anaconda3/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters Traceback (most recent call last): File "analysis.py", line 14, in vlm.set_clusters(vlm.ca["ClusterName"]) KeyError: 'ClusterName'

gioelelm commented 6 years ago

Good to hear that it worked, sorry if I didn't point out that the problem could be a previously corrupted sorted bam file.

You can pass any vector of labels or cluster name. That line of code assumes that you performed some clustering step and you stored the result in that column attribute, in the loom file. But you can substitute the column attribute with any numpy array, the info doesn't need to be stored in the loom file and can come from previous analyses.

Linda-Lan commented 6 years ago

Thank you for your prompt reply. Does anything in this script that may not necessary or cause error:

import os import velocyto as vcy from sklearn.manifold import TSNE

vlm = vcy.VelocytoLoom("319-5_prime.loom") vlm.normalize("S", size=True, log=True) vlm.S_norm vlm.plot_fractions() vlm.dump_hdf5("my_velocyto_analysis")

vlm.filter_cells(bool_array=vlm.initial_Ucell_size > np.percentile(vlm.initial_Ucell_size, 0.5)) vlm.set_clusters(vlm.ca["ClusterName"]) vlm.score_detection_levels(min_expr_counts=40, min_cells_express=30) vlm.filter_genes(by_detection_levels=True) vlm.score_cv_vs_mean(3000, plot=True, max_expr_avg=35) vlm.filter_genes(by_cv_vs_mean=True)

vlm._normalize_S(relative_size=vlm.S.sum(0), target_size=vlm.S.sum(0).mean()) vlm._normalize_U(relative_size=vlm.U.sum(0), target_size=vlm.U.sum(0).mean())

vlm.perform_PCA() vlm.knn_imputation(n_pca_dims=20, k=500, balanced=True, b_sight=3000, b_maxl=1500, n_jobs=16)

vlm.fit_gammas()

vlm.plot_phase_portraits(["Igfbpl1", "Pdgfra"])

vlm.predict_U() vlm.calculate_velocity() vlm.calculate_shift(assumption="constant_velocity") vlm.extrapolate_cell_at_t(delta_t=1.) vlm.calculate_shift(assumption="constant_unspliced", delta_t=10) vlm.extrapolate_cell_at_t(delta_t=1.)

bh_tsne = TSNE() vlm.ts = bh_tsne.fit_transform(vlm.pcs[:, :25]) vlm.estimate_transition_prob(hidim="Sx_sz", embed="ts", transform="sqrt", psc=1, n_neighbors=3500, knn_random=True, sampled_fraction=0.5) vlm.calculate_embedding_shift(sigma_corr = 0.05, expression_scaling=True)

vlm.calculate_grid_arrows(smooth=0.8, steps=(40, 40), n_neighbors=300) plt.figure(None,(20,10)) vlm.plot_grid_arrows(quiver_scale=0.6, scatter_kwargs_dict={"alpha":0.35, "lw":0.35, "edgecolor":"0.4", "s":38, "rasterized":True}, min_mass=24, angles='xy', scale_units='xy', headaxislength=2.75, headlength=5, headwidth=4.8, minlength=1.5, plot_random=True, scale_type="absolute")

gioelelm commented 6 years ago

Yes, beyond the line we discussed also the fact that the parameters are somehow assuming a dataset of the same size of the dentate gyrus one

Counts-Xin commented 4 years ago

Hi gioelelm,

I then realized I need to delete the old bam files since it seems velocyto will skip this step if old bam files exist. I re-run and it successfully generate .loom file. I run the analysis according to python tutorial. It shows error as the following. What is ClusterName I need to put? Also, do you docker image on DockerHub? Is it possible to have analysis pipeline for R studio?

[lindalan@midway-login1 velocyto]$ python analysis.py /home/lindalan/anaconda3/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters Traceback (most recent call last): File "analysis.py", line 14, in vlm.set_clusters(vlm.ca["ClusterName"]) KeyError: 'ClusterName'

sorry,so you mean delete the bam file?(possorted_genome_bam.bam)

GouQiao commented 2 years ago

Hi guys, how do you solve the matter about the no EOF marker ; file may be truncated? I deleted the old files but it didnt work.

Best