seqcode / Bichrom

Interpretable bimodal network for transcription factors binding site prediction
MIT License
6 stars 4 forks source link

Data unavailable #9

Closed CQ981001 closed 1 year ago

CQ981001 commented 1 year ago

Hi,

Directly using the sample data in folder custom_data_files to run run_bichrom.py will result in an error message as follows:

python run_bichrom.py -training_schema_yaml ../custom_data_files/bichrom.yaml -len 500 -nbins 10 -outdir ../out/

Traceback (most recent call last):
  File "/home/changq/git/test/Bichrom/trainNN/run_bichrom.py", line 27, in <module>
    train_bichrom(data_paths=data_paths, outdir=outdir, 
  File "/home/changq/git/test/Bichrom/trainNN/train.py", line 98, in train_bichrom
    mseq_path = run_seq_network(train_path=data_paths['train'], val_path=data_paths['val'],
  File "/home/changq/git/test/Bichrom/trainNN/train.py", line 52, in run_seq_network
    loss, seq_val_pr = build_and_train_net(curr_params, train_path, val_path,
  File "/home/changq/git/test/Bichrom/trainNN/train_seq.py", line 129, in build_and_train_net
    loss, val_pr = train(model, train_path=train_path, val_path=val_path,
  File "/home/changq/git/test/Bichrom/trainNN/train_seq.py", line 103, in train
    train_dataset = TFdataset(train_path, batch_size, "seqonly")
  File "/home/changq/git/test/Bichrom/trainNN/train_seq.py", line 21, in TFdataset
    TFdataset_batched = iterutils.train_TFRecord_dataset(path, batchsize, dataflag)
  File "/home/changq/git/test/Bichrom/trainNN/iterutils.py", line 157, in train_TFRecord_dataset
    files = tf.data.Dataset.from_tensors(dspath["TFRecord"])
KeyError: 'TFRecord'

I would like to know whether there is something wrong with the data you provided.

yztxwd commented 1 year ago

Hello,

The custom_data_files folder contains the example data files from the previous versions of bichrom, we switched to TFRecord format after, that's why you got the error.

The actual sample data files are in sample_data folder, you can generate your own training set for testing based on that, although you still need to provide your own bigwig file and genome fasta file due to the size limit of github repo. Example code like:

python construct_data/construct_data.py \
  -info sample_data/mm10.info \
  -fa ~/group/lab/jianyu/genome/mm10/mm10.fa \
  -len 500 \
  -acc_domains sample_data/mES_atacseq_domains.bed \
  -chromtracks ~/group/lab/jianyu/datasets/MTRG_GSE120376_Histone_Modifications/output/coverage/ES_H3K27ac-group1.bamCompare.bw \
  -peaks sample_data/Ascl1.peaks \
  -o test -nbins 10 -p 4 \
  -blacklist sample_data/mm10_blacklist.bed

However I still uploaded a very simple training set to custom_data_files for you to check if the environment is correct. you can uncompress the tar.gz file then train the bichrom at project root directory using:

python trainNN/run_bichrom.py -training_schema_yaml custom_data_files/bichrom.yaml -len 500 -outdir test_step2 -nbins 10
CQ981001 commented 1 year ago

Thank you very much for the new data you provided, but there was an error at runtime:

Epoch 1/15
Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory
Please make sure libcudnn_cnn_infer.so.8 is in your library path!

Tried a few online methods, but nothing worked. I don't know if you've ever had this problem

yztxwd commented 1 year ago

It seems to me you don't have the cudnn library, which.should be handled by conda, were you running the script on a machine without gpu? Could u list your installed packages in the conda environment?

CQ981001 commented 1 year ago

Thank you very much for your suggestion. It may be due to a mismatch in the CUDA version, and I will continue to provide you with feedback after installing the matching version. However, there were also problems encountered in the Step 1 - Construct Bichrom Input Data. The dataset [SRR520335] was downloaded from GSE39237. The bigwig file I used was obtained from dataset SRR520335 after processingand, I'm not sure if this dataseet caused the error. And genome fasta file I used was mm10. Code like:

python construct_data/construct_data.py \
  -info sample_data/mm10.info \
  -fa ~/data/mm10/fa/mm10.fa \
  -len 500 \
  -acc_domains sample_data/mES_atacseq_domains.bed \
  -chromtracks ~/data/SRR/alignment/SRR520335.bw \
  -peaks sample_data/Ascl1.peaks \
  -o test -nbins 10 -p 4 \
  -blacklist sample_data/mm10_blacklist.bed

The error message is as follows:

DEBUG:h5py._conv:Creating converter from 7 to 5
DEBUG:h5py._conv:Creating converter from 5 to 7
DEBUG:h5py._conv:Creating converter from 7 to 5
DEBUG:h5py._conv:Creating converter from 5 to 7
Creating output directory
/home/changq/git/Bichrom/test
Recording output paths
['SRR520335']
Constructing train data ...
DEBUG:root:Bound samples in total: 92120
DEBUG:root: Bound samples in accessible region: 46818
DEBUG:root: Bound samples NOT in accessible region: 45302
DEBUG:root:# Unbound flank sample: 105844
DEBUG:root:# Unbound accessible sample: 46818
DEBUG:root:# Unbound inaccessible sample: 45302
INFO:root:Constructing training set for sequence network
INFO:root:It should satisfy two requirements: 1. Positive and Negative sample size should equal 2. Ratio of accessible region intersection should be balanced
DEBUG:root:training coordinates negative samples in accessible regions: 46818
DEBUG:root:training coordinates negative samples in inaccessible regions: 45302
DEBUG:root:label  type     
0      neg_acc      32937
       neg_flank    43562
       neg_inacc    15621
1      pos_shift    92120
dtype: int64
DEBUG:root:label  type      
0      neg_genome    92120
1      pos_shift     92120
dtype: int64
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/changq/anaconda3/envs/bichrom/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/changq/anaconda3/envs/bichrom/lib/python3.9/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/home/changq/git/Bichrom/construct_data/utils.py", line 321, in get_data_TFRecord_worker
    seq = genome_pyfasta[item.chrom][int(item.start):int(item.end)]
  File "/home/changq/anaconda3/envs/bichrom/lib/python3.9/site-packages/pyfasta/fasta.py", line 128, in __getitem__
    c = self.index[i]
KeyError: 'chr2'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/changq/git/Bichrom/construct_data/construct_data.py", line 307, in <module>
    main()
  File "/home/changq/git/Bichrom/construct_data/construct_data.py", line 252, in main
    TFRecords_train_seq, TFRecords_train_bichrom = construct_training_set(genome_sizes_file=args.info, genome_fasta_file=args.fa,
  File "/home/changq/git/Bichrom/construct_data/construct_data.py", line 152, in construct_training_set
    TFRecord_file_seq_f = utils.get_data_TFRecord(train_coords_seq, genome_fasta_file, chromatin_track_list, 
  File "/home/changq/git/Bichrom/construct_data/utils.py", line 295, in get_data_TFRecord
    res = res.get()
  File "/home/changq/anaconda3/envs/bichrom/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
KeyError: 'chr2'

I think the error was caused by the dataset I used. If so, I wonder if you could provide a sample file of file bigwig. And if I use my own bigwig file, can I use the - peaks and - acc_domains parameters you provided. Finally, since I am a beginner in biological information, I am very sorry if my asking these questions bothers you.

CQ981001 commented 1 year ago

Can you also provide the mm10.fa file? I would like to verify if the error is caused by a GPU configuration issue.

yztxwd commented 1 year ago

The error message is basically telling you your genome fasta file doesn't have chromosome "chr2", you can download the mm10 genome fasta from ucsc https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/

CQ981001 commented 1 year ago

Thank you very much for your advice. After changing the version of CUDA and cuDNN and the reference data mm10, the script runs through successfully. One last question,if I use my own bigwig file, can I use the - peaks and - acc_domains parameters you provided.The parameters appear to be generated by MultiGPS, but are not available for download at this link:https://mahonylab.org/software/multigps/

yztxwd commented 1 year ago

That depends on what biological question you are asking. If you only want to run the scripts, you can use the peak and acc_domain file from the repo. The MultiGPS format is simply chromosome:coordinate, you can use other peak caller like macs2 to generate bed output then convert into MultiGPS format. MultiGPS can be downloaded from https://github.com/seqcode/multigps/releases

yztxwd commented 1 year ago

Issue should have been resolved, reopen it if need