giesselmann / STRique

Nanopore raw signal repeat detection pipeline
MIT License
45 stars 10 forks source link

Empty STRique output #4

Closed MaestSi closed 5 years ago

MaestSi commented 5 years ago

Dear STRique developers, I tried out your software, but got empty output a part from the header. I created a config file with the sequences of interest and mapped my MinION reads to human reference genome hg38 with minimap2. This is the command I used for running STRique:

STRIQUE_ROOT=/home/simone/MinION/software/STRique
STRIQUE="python3 /home/simone/MinION/software/STRique/scripts/STRique.py"
REPEAT=/home/simone/MinION/Cas9/repeat_config.tsv

$STRIQUE index --recursive $FAST5 > $INDEX
samtools view $BAM | $STRIQUE count $INDEX $STRIQUE_ROOT/models/template_median68pA6mer.model $REPEAT --t 30 > STRique_output.txt

And this is the output I got:

ID target strand count score_prefix score_suffix log_p offset ticks

The target region I am interested in has coverage depth of about 13X. Is it enough? Is STRique able to differentiate between the two haplotypes? All tests I performed with python3 scripts/STRique_test.py gave PASS results. When looking in IGV at reads mapped to the target region I can roughly count the number of triplets. Do you have any suggestions? Thanks in advance

giesselmann commented 5 years ago

Hey @MaestSi, can you post the values of $INDEX and $FAST5? STRique is not complaining if it does not find fast5 files and the index file stores only relative paths. Pay

MaestSi commented 5 years ago

Sure, here they are:

FAST5=/home/simone/MinION/Cas9/20190328_1110_MDCas9 INDEX=/home/simone/MinION/Cas9/MD_Cas9.fofn

Simone

giesselmann commented 5 years ago

Okay I guess the issue is the index then: You ran the indexing for the folder

/home/simone/MinION/Cas9/20190328_1110_MDCas9

therefore you would need to store your index as e.g.

/home/simone/MinION/Cas9/20190328_1110_MDCas9/MD_Cas9.fofn

Alternatively you can use a (not yet documented) feature and run:

$STRIQUE index --out_prefix 20190328_1110_MDCas9 --recursive $FAST5 > $INDEX

To explain this: The index.fofn contains relative paths to the fast5 files, starting from the directory you provide in the indexing step. If not saved to the correct location, the paths are not valid and no reads are found resulting in no output. If this causes similar problems in the future, I'll implement some checking and warning message if the index seems to be corrupted.

Pay

MaestSi commented 5 years ago

I changed INDEX to:

INDEX=/home/simone/MinION/Cas9/20190328_1110_MDCas9/MDCas9.fofn

but still got no output. Should it be correct? As an example, the first line of INDEX file is:

fast5/20/Marta_PC_20190328_FAJ01316_MN20885_sequencing_run_MDCas9_85850_read_45042_ch_418_strand2.fast5 f99a3273-d83e-452d-8570-982b910b32a0

giesselmann commented 5 years ago

Looks okay to me.

ls home/simone/MinION/Cas9/20190328_1110_MDCas9/fast5/20/Marta_PC_20190328_FAJ01316_MN20885_sequencing_run_MDCas9_85850_read_45042_ch_418_strand2.fast5

should list you the fast5 file.

Could you please check, that alignment and repeat config are on the same coordinates (hg19/hg38,...). STRique is scanning the alingment cigar and only starting a detection, if alignment (+ soft-clipping) spans any of the configured target regions.

I ran the test/example commands from the README, giving me the expected output line. Saying if index, configuration and input alignment match, you should definitely see some output.

MaestSi commented 5 years ago

Yes, the above command lists the fast5 file. All genomic coordinates are hg38-based. I tried again giving slightly different prefix and suffix sequences in the config file, but still didn't succeed. May lower-case letters in the reference (and not in the repeat, prefix and suffix sequences listed in the config file) be an issue?

giesselmann commented 5 years ago

Lower-case in reference doesn't matter, the prefix/suffix sequences in the config should be upper case. You would also get an error otherwise. I set up a better logging and pushed a new version to the master branch, could you please update using:

cd STRique
git pull
python3 setup.py install

and re-run the pipeline as above but with --log_level debug

samtools view $BAM | $STRIQUE count $INDEX $STRIQUE_ROOT/models/template_median68pA6mer.model $REPEAT --t 30 --log_level debug > STRique_output.txt

and post the output here. This will also print target names, feel free to mask confidential information in there!

MaestSi commented 5 years ago

A different error occurred this time. This is a part of the error message printed to stdout.

09.04.2019 13:29:42 [PID 33196] [INFO] Logger created. 09.04.2019 13:29:42 [PID 33196] [DEBUG] Main: Parsed config. 09.04.2019 13:29:43 [PID 33235] [WARNING] Factory: Unexpected error in Worker, proceeding wiht remaining reads. Traceback (most recent call last):

File "/home/simone/MinION/software/STRique/scripts/STRique.py", line 682, in worker input = worker_callable(**input)

File "/home/simone/MinION/software/STRique/scripts/STRique.py", line 611, in detect self.__init_hmm__()

File "/home/simone/MinION/software/STRique/scripts/STRique.py", line 572, in __init_hmm__ self.repeatCounter.add_target(target_name, repeat, prefix, suffix)

File "/home/simone/MinION/software/STRique/scripts/STRique.py", line 499, in add_target flankedRepeatHMM(repeat, prefix, suffix, self.pm, self.HMM_config) )

File "/home/simone/MinION/software/STRique/scripts/STRique.py", line 397, in init self.__build_model__()

File "/home/simone/MinION/software/STRique/scripts/STRique.py", line 426, in __build_model__ self.bake(merge='All')

File "pomegranate/hmm.pyx", line 765, in pomegranate.hmm.HiddenMarkovModel.bake

TypeError: 'method' object is not iterable

The output file is empty a part from the header as before.

giesselmann commented 5 years ago

Okay, so building the model fails. Can you confirm, that not only the script STRique_test.py passes but also the following lines from the README?

python3 scripts/STRique.py index data/ > data/reads.fofn
cat data/c9orf72.sam | python3 scripts/STRique.py count ./data/reads.fofn ./models/template_median68pA6mer.model ./configs/repeat_config.tsv --config ./configs/STRique.json

If they give the expected output, you installation in general works and there must be something wrong with the repeat config file. Can you share the config, if you want send it per mail to giesselmann[at]molgen.mpg.de

MaestSi commented 5 years ago

The second command gives error:

[INFO] not in [<log_type.Error: '[ERROR]'>, <log_type.Warning: '[WARNING]'>] [DEBUG] not in [<log_type.Error: '[ERROR]'>, <log_type.Warning: '[WARNING]'>] ID target strand count score_prefix score_suffix log_p offset ticks [DEBUG] not in [<log_type.Error: '[ERROR]'>, <log_type.Warning: '[WARNING]'>] 09.04.2019 15:35:11 [PID 33810] [WARNING] Factory: Unexpected error in Worker, proceeding wiht remaining reads. Traceback (most recent call last):

File "scripts/STRique.py", line 682, in worker input = worker_callable(**input)

File "scripts/STRique.py", line 611, in detect self.__init_hmm__()

File "scripts/STRique.py", line 572, in __init_hmm__ self.repeatCounter.add_target(target_name, repeat, prefix, suffix)

File "scripts/STRique.py", line 499, in add_target flankedRepeatHMM(repeat, prefix, suffix, self.pm, self.HMM_config) )

File "scripts/STRique.py", line 397, in init self.__build_model__()

File "scripts/STRique.py", line 426, in __build_model__ self.bake(merge='All')

File "pomegranate/hmm.pyx", line 765, in pomegranate.hmm.HiddenMarkovModel.bake

TypeError: 'method' object is not iterable

[DEBUG] not in [<log_type.Error: '[ERROR]'>, <log_type.Warning: '[WARNING]'>] [DEBUG] not in [<log_type.Error: '[ERROR]'>, <log_type.Warning: '[WARNING]'>]

Sure, I am going to send you my config file. Thanks, Simone

giesselmann commented 5 years ago

No need to do so, it looks like pomegranate itself is causing the error. Can you please install STRique in a separate virtual environment. The problem is, that we had to freeze the pomegranate and networkx package versions in a previous release. Alternatively if you have the chance you may want to use the Docker container as it is stand-alone.

MaestSi commented 5 years ago

I was already running STRique in a separate conda virtual environment. Could you please conda list pomegranate and networkx versions?

giesselmann commented 5 years ago

they are in the requirements.txt, with

pip install -r requirements.txt --upgrade

you force an update

MaestSi commented 5 years ago

After repeating the installation today, STRique is now working. STRique output now contains as many rows as the number of reads mapped to the target region. I guess I should identify 2 clusters and average out the repeat counts in each cluster. Are there any utilities within the STRique framework to do that, or are there any future plans for that? Thanks, Simone

giesselmann commented 5 years ago

I'm happy to hear the pipeline is working. The interpretation is not part of STRique but filtering on prefix/suffix alignment scores might be a good first step to obtain reliable results. We will continue working on the accuracy of the tool. I can't say anything about extending it by further downstream analysis as these are very application specific. Pay