mcgilldinglab / MATES

A Deep Learning-Based Model for Quantifying Transposable Elements in Single-Cell Sequencing Data
MIT License
11 stars 0 forks source link

NameError: name 'exit' is not defined #5

Open Citugulia40 opened 5 months ago

Citugulia40 commented 5 months ago

Hi, Thanks for developing this helpful tool.

I am running mates on 10X data and I am on my first step which is:

bam_processor.split_bam_files('10X', 20, 'sample_list_file.txt', 'bam_path_file.txt', 'bc_path_file.txt')

but I am getting this error. I have installed it on python 3.11, which version would you recommend and how can i solve this error?

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 1
----> 1 bam_processor.split_bam_files('10X', 200, 'sample_list_file.txt', 'bam_path_file.txt', 'bc_path_file.txt')

File /data2/ccitu/software/MATES/MATES/bam_processor.py:71, in split_bam_files(data_mode, threads_num, sample_list_file, bam_path_file, bc_ind, long_read, bc_path_file)
     69 if bc_path_file == None:
     70     print("Please provide barcodes file for 10X data!")
---> 71     exit(1)
     72 print("Start splitting multi sub-bam based on cell barcodes...")
     74 processes = []

NameError: name 'exit' is not defined
Szym29 commented 5 months ago

Hello,

This issue should be fixed now.

@RoKsaNne , can you please use raise ValueError('xxxx') instead of exit(1)? Please check this script.

Thanks.

Citugulia40 commented 4 months ago

Hi,

Thanks for your reply.

Does the issue fixed now?

Szym29 commented 4 months ago

Hi,

@RoKsaNne is working on it. We will let you know once it's fixed.

Thanks.

RoKsaNne commented 3 months ago

Hi @Citugulia40 ,

Sorry for the late response. For the first step, the command should be: bam_processor.split_bam_files('10X', 20, 'sample_list_file.txt', 'bam_path_file.txt', bc_ind='CR', bc_path_file='bc_path_file.txt')

In this command, bc_ind represents the indicator of the barcode field in your BAM file. Also, I recommend choosing the thread number =< sample number in this step and maybe more in the following steps.

We have updated a sample dataset along with a walkthrough pipeline in the example folder. Please check it out. I hope this is helpful.

Citugulia40 commented 3 months ago

Hi @RoKsaNne

Thanks for you response

I successfully ran the first step but when I am training the model, I am encountering an issue.

MATES_model.train('10X', 'test_samplelist.txt', bin_size = 5, proportion = 80, BATCH_SIZE= 36, AE_LR = 1e-6, MLP_LR = 1e-6, AE_EPOCHS = 150, MLP_EPOCHS = 150, DEVICE= 'cude:0')

_`CUDA is not available.
Data Mode:  10X
AE Settings:  Epoch:    150, Learning Rate: 0.000001
MLP Settings: Epoch:    150, Learning Rate: 0.000001
Batch Size:     36
Searching Bin Size:      5
Dominate Proportion:     80
Loading training data for test...
Training model for test...
  0%|          | 0/150 [00:00<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[15], line 1
----> 1 MATES_model.train('10X', 'test_samplelist.txt', bin_size = 5, proportion = 80, BATCH_SIZE= 36, AE_LR = 1e-6, MLP_LR = 1e-6, AE_EPOCHS = 150, MLP_EPOCHS = 150, DEVICE= 'cpu')

File /data2/ccitu/software/MATES/MATES/MATES_model.py:16, in train(data_mode, sample_list_file, bin_size, proportion, BATCH_SIZE, AE_LR, MLP_LR, AE_EPOCHS, MLP_EPOCHS, DEVICE)
     14         sample_name = [line.rstrip('\n') for line in sample_file]
     15     for idx, sample in enumerate(sample_name):
---> 16         MATES_train(data_mode, sample, bin_size, proportion, BATCH_SIZE, AE_LR, MLP_LR, 
     17              AE_EPOCHS, MLP_EPOCHS, DEVICE)
     19 elif data_mode == 'Smart_seq':
     20     MATES_train(data_mode, sample_list_file, bin_size, proportion, BATCH_SIZE, AE_LR, MLP_LR, 
     21              AE_EPOCHS, MLP_EPOCHS, DEVICE)

File /data2/ccitu/software/MATES/MATES/scripts/train_model.py:383, in MATES_train(data_mode, file_name, bin_size, prop, BATCH_SIZE, AE_LR, MLP_LR, AE_EPOCHS, MLP_EPOCHS, DEVICE)
    380     MLP_meta_train=pickle.load(f)
    382 print("Training model for " + sample + '...')
--> 383 pretrain_AE(AE_EPOCHS, bin_size, prop, BATCH_SIZE, DEVICE, AE_LR,TE_FAM_NUMBER,
    384             TE_train, Batch_train, data_mode, sample)
    386 Meta_Data, hidden_info, Batch_Info, Region_Info = get_AE_embedding(data_mode, bin_size, prop, 
    387                                                                 BATCH_SIZE, DEVICE, AE_LR,TE_FAM_NUMBER,
    388                                                                 MLP_TE_train, MLP_Batch_train, 
    389                                                                 MLP_Region_train, MLP_meta_train, AE_EPOCHS, sample)
    392 MLP_trained_loader = get_MLP_input(BATCH_SIZE, Meta_Data, hidden_info, Batch_Info, Region_Info)

File /data2/ccitu/software/MATES/MATES/scripts/train_model.py:67, in pretrain_AE(EPOCHS, bin_size, prop, BATCH_SIZE, device, AE_LR, TE_FAM_NUMBER, TE_train, Batch_train, data_mode, sample)
     65 Batch_info  = Batch_ids.clone().detach().view(BATCH_SIZE,1)
     66 BATCH_data.data.copy_(Batch_info)
---> 67 hidden, reconstruct = AENet(TE_data*1e6, BATCH_data, BATCH_SIZE)
     68 loss = loss_f(reconstruct, TE_data*1e6)
     69 if epoch+1 == EPOCHS:

File ~/anaconda3/envs/mates_env/lib/python3.9/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /data2/ccitu/software/MATES/MATES/scripts/AutoEncoder.py:33, in AutoEncoder.forward(self, TE_data, BATCH_data, BATCH_SIZE)
     32 def forward(self, TE_data, BATCH_data, BATCH_SIZE, ):
---> 33     reshaped_TE=torch.reshape(TE_data, (BATCH_SIZE,2001)).to('cuda:0')
     34     ##one-hot encoding TE Fam info
     35     batch_id_encode = torch.eye(self.n_fam)[BATCH_data.type(torch.LongTensor)].view(BATCH_SIZE, self.n_fam).to('cuda:0')

File ~/anaconda3/envs/mates_env/lib/python3.9/site-packages/torch/cuda/__init__.py:247, in _lazy_init()
    245 if 'CUDA_MODULE_LOADING' not in os.environ:
    246     os.environ['CUDA_MODULE_LOADING'] = 'LAZY'
--> 247 torch._C._cuda_init()
    248 # Some of the queued calls may reentrantly call _lazy_init();
    249 # we need to just return without initializing in that case.
    250 # However, we must not let any *other* threads in!
    251 _tls.is_initializing = True

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx`_

We don't have GPU server, so when I am switching the device to 'cpu', I am getting the same error.

We can only run MATES on GPU servers?

Thanks

RoKsaNne commented 3 months ago

Hi, thanks for pointing that out. I updated the code and it can now run on the CPU.

You can change the above section of codes as below:

from MATES import MATES_model

MATES_model.train('10X', 'test_samplelist.txt', bin_size = 5, proportion = 80, BATCH_SIZE= 36, AE_LR = 1e-6, MLP_LR = 1e-6, AE_EPOCHS = 150, MLP_EPOCHS = 150, DEVICE= 'cpu')

MATES_model.prediction('exclusive', '10X', 'test_samplelist.txt', bin_size = 5, proportion = 80, AE_trained_epochs =150, MLP_trained_epochs=150, DEVICE= 'cpu', ref_path = 'Default')

MATES_model.prediction_locus('exclusive', '10X', 'test_samplelist.txt', bin_size=5, proportion=80, AE_trained_epochs=150, MLP_trained_epochs=150, DEVICE= 'cpu', ref_path = 'Default')

to run MATES on the CPU.

However, based on my attempts, it takes a few hours to finish training the model for the example data on the CPU while it only a few seconds on the GPU.

Citugulia40 commented 3 months ago

Thanks for your reply.

Yes, it is able to run when I am running

MATES_model.train('10X', 'test_samplelist.txt', bin_size = 5, proportion = 80, BATCH_SIZE= 36, AE_LR = 1e-6, MLP_LR = 1e-6, AE_EPOCHS = 150, MLP_EPOCHS = 150, DEVICE= 'cpu')

But in prediction steps, I am getting the same error.

MATES_model.prediction('exclusive', '10X', 'test_samplelist.txt', bin_size = 5, proportion = 80, AE_trained_epochs =150, MLP_trained_epochs=150, DEVICE= 'cpu', ref_path = 'Default')

MATES_model.prediction_locus('exclusive', '10X', 'test_samplelist.txt', bin_size=5, proportion=80, AE_trained_epochs=150, MLP_trained_epochs=150, DEVICE= 'cpu', ref_path = 'Default')

Also, Can we run MATES on CellRanger bam files?

bam_processor.split_bam_files('10X', 20, 'sample_list_file.txt', 'bam_path_file.txt', bc_ind='CR', bc_path_file='bc_path_file.txt')

For CellRanger, we should specify 'CB' in 'bc_ind' instead of 'CR', is that right?

Thanks in advance

Best

RoKsaNne commented 3 months ago

Hi @Citugulia40 ,

It should work now, partially code was not pushed to the github, sorry about that.

And yes, you can run MATES on CellRanger using 'CB' if it has the 'NH' field in the bam file. However according to the documentation on the CellRanger website,

The older versions of cellranger (such as v3.1.0,) will output all the alignments of multi-mapping reads to the BAM file. Newer versions of cellranger retains just one record of a multi-mapping read.

We recommend using STARsolo to keep the multimapping reads for quantification.

Citugulia40 commented 3 months ago

Thank you so much for your reply.

The pipeline is running fine now on the test dataset.

Although, I am getting an error when running the CellRanger bam file

bam_processor.split_bam_files('10X', 20, 'test_samplelist.txt', 'test_bam_path.txt',bc_ind = 'CB', bc_path_file='test_cb_path.txt')

Error

Directory ./file_tmp created.
Directory ./bam_tmp created.
Directory ./bc_tmp created.
Start splitting bam files into unique/multi reads sub-bam files ...
Directory ./unique_read created.
Directory ./multi_read created.
Finish splitting bam files into unique reads and multi reads sub-bam files.
Start splitting multi sub-bam based on cell barcodes...
[E::hts_open_format] Failed to open file "./unique_read/test/by_barcode/*.bam" : No such file or directory
samtools sort: can't open "./unique_read/test/by_barcode/*.bam": No such file or directory
[E::hts_open_format] Failed to open file "./unique_read/test/by_barcode/*.bam" : No such file or directory
samtools index: failed to open "./unique_read/test/by_barcode/*.bam": No such file or directory
Finish splitting unique sub-bam.
Finish splitting multi sub-bam.
Directory ./file_tmp removed.
Directory ./bam_tmp removed.
Directory ./bc_tmp removed.
[E::hts_open_format] Failed to open file "./multi_read/test/by_barcode/*.bam" : No such file or directory
samtools sort: can't open "./multi_read/test/by_barcode/*.bam": No such file or directory
[E::hts_open_format] Failed to open file "./multi_read/test/by_barcode/*.bam" : No such file or directory
samtools index: failed to open "./multi_read/test/by_barcode/*.bam": No such file or directory

I can try to use STARsolo for creating the BAM files.

Thanks again.

Citugulia40 commented 3 months ago

Hi,

Sorry for bothering you again

I have processed my data using STARsolo and this is the structure of my BAM file

SRR10278808.14031871    73  chr1    11291   0   30S86M  *   0   0   AAGCAGTGGTATCAACGCAGAGTACATGGGTGCTGGCGCCGGGGCACTGCAGGGCCCTCTTGCTTACTGTATAGTGGTGGCACGCCGCCTGCTGGCAGCTAGGGACATTGCAGGGT    AAAAAEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEAEEEEEEEEEEEE<EEEEE    CR:Z:TCATTTGGTCGGGTCT   UR:Z:TTAAATTAGT CY:Z:AAAAAEEEEEEEEEEE   UY:Z:EEEEEEEEEE NH:i:8  HI:i:1  CB:Z:TCATTTGGTCGGGTCT   UB:Z:TTAAATTAGT
SRR10278809.2147705 73  chr1    11291   0   28S88M  *   0   0   GCAGTGGTATCAACGCAGAGTACATGGGTGCTGGCGCCGGGGCACTGCAGGGCCCTCTTGCTTACTGTATAGTGGTGGCACGCCGCCTGCTGGCAGCTAGGGACATTGCAGGGTCC    AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEE/EEEEEE<EEE<EAEEEAEEEE    CR:Z:TCATTTGGTCGGGTCT   UR:Z:TTAAATTAGT CY:Z:AAAAAEEEEEEEEEEE   UY:Z:EEEEEEEEEE NH:i:8  HI:i:1  CB:Z:TCATTTGGTCGGGTCT   UB:Z:TTAAATTAGT

and this is my barcode file

AAACCTGAGTAACCCT
AAACCTGGTAGCGTAG
AAACCTGGTAGTAGTA
AAACCTGTCAGGCCCA
AAACGGGCAGGGTTAG
AAACGGGGTCCGAATT
AAACGGGGTTCGAATC
AAACGGGTCAAGAAGT
AAACGGGTCAGTTCGA
AAACGGGTCTAACTTC

I am running MATES on 8 samples and all the samples are processed using

STAR --soloType CB_UMI_Simple --soloCBwhitelist 737K-august-2016.txt --soloMultiMappers EM --runThreadN 64 --genomeDir /data2/ccitu/software/STAR/human_index --outFileNamePrefix AD1_starsolo --readFilesIn AD1_R1.fastq AD1_R3.fastq AD1_R2.fastq --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --outSAMattributes CR UR CY UY CB UB NH HI

All the steps are running fines but in the last step, I am getting

TE_quantifier.quantify_locus_TE_MTX('exclusive', '10X', 'dataset2_samplelist.txt')

Error:

Finish finalizing Unique TE MTX for AD3
Finish finalizing Unique TE MTX for AD5
Finish finalizing Unique TE MTX for AD7
Finish finalizing Unique TE MTX for AD1
Finish finalizing Unique TE MTX for Ct1
Finish finalizing Unique TE MTX for Ct3
Finish finalizing Unique TE MTX for Ct5
Finish finalizing Unique TE MTX for Ct7
Finalizing locus expression matrix for AD3...
/home/ccitu/anaconda3/envs/mates_env/lib/python3.9/site-packages/anndata/_core/aligned_df.py:67: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)
/home/ccitu/anaconda3/envs/mates_env/lib/python3.9/site-packages/anndata/_core/aligned_df.py:67: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)
/home/ccitu/anaconda3/envs/mates_env/lib/python3.9/site-packages/anndata/_core/aligned_df.py:67: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)
/home/ccitu/anaconda3/envs/mates_env/lib/python3.9/site-packages/anndata/_core/aligned_df.py:67: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)
Finis finalizing locus expression matrix for AD3.
Finalizing locus expression matrix for AD5...
/home/ccitu/anaconda3/envs/mates_env/lib/python3.9/site-packages/anndata/_core/aligned_df.py:67: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)
/home/ccitu/anaconda3/envs/mates_env/lib/python3.9/site-packages/anndata/_core/aligned_df.py:67: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)
/home/ccitu/anaconda3/envs/mates_env/lib/python3.9/site-packages/anndata/_core/aligned_df.py:67: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)
/home/ccitu/anaconda3/envs/mates_env/lib/python3.9/site-packages/anndata/_core/aligned_df.py:67: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[40], line 1
----> 1 TE_quantifier.quantify_locus_TE_MTX('exclusive', '10X', 'dataset2_samplelist.txt')

File ~/anaconda3/envs/mates_env/lib/python3.9/site-packages/MATES/TE_quantifier.py:141, in quantify_locus_TE_MTX(TE_mode, data_mode, sample_list_file)
    139 # Add the values for the same features
    140 common_vars = adata_multi.var_names.intersection(adata_unique.var_names)
--> 141 adata_multi[:, common_vars].X += adata_unique[:, common_vars].X
    143 # Concatenate AnnData objects along the features axis (axis=1)
    144 combined_adata = ad.concat([adata_multi, adata_unique[:, adata_unique.var_names.difference(common_vars)]], axis=1)

File ~/anaconda3/envs/mates_env/lib/python3.9/site-packages/scipy/sparse/_base.py:466, in _spbase.__add__(self, other)
    464 elif issparse(other):
    465     if other.shape != self.shape:
--> 466         raise ValueError("inconsistent shapes")
    467     return self._add_sparse(other)
    468 elif isdense(other):

ValueError: inconsistent shapes

Additionally, for all my samples, the number of lines in the files within the result_MTX folder are as follows:

 2 Multi_TE_MTX.csv
 3 TE_MTX.csv
 2 Unique_TE_MTX.csv

Is there anything wrong with my files?

Please help

Thanks in advance.

Best