hoffmangroup / segway

Application for semi-automated genomic annotation.
http://segway.hoffmanlab.org/
GNU General Public License v2.0
13 stars 7 forks source link

Error while running segway train : "can't tie one track in multiple groups" and "Set of training windows is empty" #148

Open Cheryn-A opened 4 years ago

Cheryn-A commented 4 years ago

Hello, I am having troubles running segway, here is my situation. I wish to use segway with 6 ChIP-seq tracks. I started from raw data in fastq format of each signal and its associated input (s1.fastq , s1_input.fastq, ... , s6.fastq , s6_input.fastq)

In here https://www.biorxiv.org/content/10.1101/080382v1.full you mention at some point "The preferred input for Segway is the “fold change over control” bigWig signal file, because it is already processed and normalized." so here is the full pipeline I followed :

1-Pre-processing each fastq file:

2-Generating bw files with deepTools-3.0.2:

bamCompare --bamfile1 s1_trimmed_sort_rmDup.bam --bamfile2 s1_input_trimmed_sort_rmDup.bam --binSize 10 --normalizeUsing RPKM --effectiveGenomeSize 3099734149 --smoothLength 1 --operation ratio --scaleFactorsMethod None --scaleFactors 1:1 -o s1_norm.bw

(Output: s1_norm.bw ... s6_norm.bw)

3-Converting bigwig into bedGraph

bigWigToBedGraph s1_norm.bw s1_norm.bedGraph

(Output: s1_norm.bedGraph ... s6_norm.bedGraph)

4-Generating genomedata files with genomedata-1.4.4:

genomedata-load-seq s1_norm.genomedata GRCh38.primary_assembly.genome.fa
genomedata-open-data s1_norm.genomedata -- tracknames s1_norm;
genomedata-load-data s1_norm.genomedata s1_norm < s1_norm.bedGraph;
genomedata-close-data s1_norm.genomedata

(Output: s1_norm.genomedata ... s6_norm.genomedata)

5-Running segway

mkdir s_traindir
segway train --resolution 10 --num-instance 10 --minibatch-fraction 0.01 --num-labels 18 s1.genomedata s2.genomedata s3.genomedata s4.genomedata s5.genomedata s6.genomedata s_traindir

This gave me the following error:

Traceback (most recent call last):
  File "/home/cheryn/anaconda3/bin/segway", line 10, in <module>
    sys.exit(main())
  File "/home/cheryn/anaconda3/lib/python3.7/site-packages/segway/run.py", line 4265, in main
    return runner()
  File "/home/cheryn/anaconda3/lib/python3.7/site-packages/segway/run.py", line 3841, in __call__
    self.run(*args, **kwargs)
  File "/home/cheryn/anaconda3/lib/python3.7/site-packages/segway/run.py", line 3817, in run
    self.run_train()
  File "/home/cheryn/anaconda3/lib/python3.7/site-packages/segway/run.py", line 3267, in run_train
    self.init_train()
  File "/home/cheryn/anaconda3/lib/python3.7/site-packages/segway/run.py", line 3172, in init_train
    self.init_shared()
  File "/home/cheryn/anaconda3/lib/python3.7/site-packages/segway/run.py", line 3161, in init_shared
    self.save_gmtk_input()
  File "/home/cheryn/anaconda3/lib/python3.7/site-packages/segway/run.py", line 2247, in save_gmtk_input
    self.set_tracknames()
  File "/home/cheryn/anaconda3/lib/python3.7/site-packages/segway/run.py", line 1734, in set_tracknames
    self.add_track_group([trackname])  # Adds to self.tracks
  File "/home/cheryn/anaconda3/lib/python3.7/site-packages/segway/run.py", line 949, in add_track_group
    raise ValueError("can't tie one track in multiple groups")
ValueError: can't tie one track in multiple groups

So I tried to follow the exemple in https://segway.readthedocs.io/en/latest/quick.html#acquiring-data and to run it with less parameters and on one genomedata file:

segway train s1.genomedata s_traindir

And I got this error:

Traceback (most recent call last):
  File "/home/cheryn/anaconda3/bin/segway", line 10, in <module>
    sys.exit(main())
  File "/home/cheryn/anaconda3/lib/python3.7/site-packages/segway/run.py", line 4265, in main
    return runner()
  File "/home/cheryn/anaconda3/lib/python3.7/site-packages/segway/run.py", line 3841, in __call__
    self.run(*args, **kwargs)
  File "/home/cheryn/anaconda3/lib/python3.7/site-packages/segway/run.py", line 3817, in run
    self.run_train()
  File "/home/cheryn/anaconda3/lib/python3.7/site-packages/segway/run.py", line 3267, in run_train
    self.init_train()
  File "/home/cheryn/anaconda3/lib/python3.7/site-packages/segway/run.py", line 3172, in init_train
    self.init_shared()
  File "/home/cheryn/anaconda3/lib/python3.7/site-packages/segway/run.py", line 3161, in init_shared
    self.save_gmtk_input()
  File "/home/cheryn/anaconda3/lib/python3.7/site-packages/segway/run.py", line 2256, in save_gmtk_input
    observations.locate_windows(genomes)
  File "/home/cheryn/anaconda3/lib/python3.7/site-packages/segway/observations.py", line 949, in locate_windows
    raise ValueError("Set of training windows is empty")
ValueError: Set of training windows is empty

When I try to run it with the test.genomedata given here https://segway.readthedocs.io/en/latest/quick.html#acquiring-data it works well.

So I am a bit confused by the error messages and I do not know how to solve this.

Could you please help me? Did I do something wrong while generating my genomedata? Do I correctly use the command line to run segway?

I thank you in advance for your answers.

PS: I am working on Ubuntu 18.04.5

EricR86 commented 3 years ago

@Cheryn-A hello and sorry for the very late reply. I have notifications setup for this repository and to my surprise, I did not get one for this issue.

It looks like you are having issues with Genomedata creation. Notably, at some point in your steps, there seems to be a lack of data. "Empty windows" typically means there's no data to work with. The error can't tie one track in multiple groups I believe might indicate that you're accidentally using the same trackname in each of your genomedata archives.

With genomedata-open-data, the tracknames option should be trackname with no space between it and the hyphen and the positional arguments should come after the options (though I don't really know if it makes a difference): genomedata-open-data --trackname s1_norm s1_norm.genomedata.

The rest of your commands look correct. I would also ensure that you choose a different trackname across your archives so Segway can uniquely determine your datasets.

Most of the commands also come with a verbosity option that I would highly recommend using when debugging these issues. For example, it would help verify how much data is being loaded.

For general troubleshooting on Segway issues I would highly recommend e-mailing to the mailing list: segway-l@listserv.utoronto.ca to reach a larger audience potentially.

varsha090597 commented 3 years ago

Hello, I am facing the same issue "ValueError: Set of training windows is empty". I did take into account what you mentioned about the genomedata file creation in your response, but that still does not help. Any help with this would help. Thanks.

EricR86 commented 3 years ago

@varsha090597 could please put in your segway train command?

Although it is hard to tell, the most likely situation is that there is actually no data in your genomedata archives. Perhaps try creating them again with the --verbose option to verify data is being loaded in.