lpantano / seqcluster

small RNA analysis from NGS data
http://seqcluster.readthedocs.io
MIT License
35 stars 17 forks source link

problem with "Prepare samples" step #19

Closed AlexandraBomane closed 7 years ago

AlexandraBomane commented 8 years ago

Hello !

I'm trying to use seqcluster to handle multi-reads obtained from small RNA-seq. When I use "seqcluster prepare", I get the output files : log/, seqs.fastq, seqs.ma & stats_prepare.tsv. But they contain any information.

seqs.fastq is totally empty; seqs.ma contains only headers : id seq SCR1 SCR2 SCR7

stats_prepare.tsv contains : total 1 SCR1 added 0 SCR1 total 1 SCR2 added 0 SCR2 total 1 SCR7 added 0 SCR7 What does this mean ?

I wanted to know how exactly I can prepare my samples before the clustering step. Indeed, I red that seqs.ma is recommended for that.

Thanks for your help ! :) Alexandra

lpantano commented 8 years ago

Hi Alexandra,

There was another person with this problem that was solved once he used the last version. Can you tell me the version you are using, and the info you see in the terminal when the tool ran? If it is the last version there is some information that can help to debug this. you can get the last one with bioconda:

conda install seqcluster -c bioconda

As well, can you paste here the first lines of one of those samples?

For some reason, any read is passing the filters that are used. You can set up this passing arguments like:

--min-shared 0 -e 1

let me know if something of this helps.

thanks

AlexandraBomane commented 8 years ago

Hi Ipantano,

I use the version contained in the Docker image "lpantano/bcbio-srnaseq:v1". I guess it is not the last version. Would it be possible to have it in a Docker image ? I think it would be very convenient.

This is an example of what I get for the collapse step (I trimmed adapter before) : docker run --rm lpantano/bcbio-srnaseq:v1 bash -c "/usr/local/bcbio/anaconda/bin/./seqcluster collapse -f sample1.fq -o outCollapsedSample1"

INFO Run collapse INFO writing output INFO It took 1.749 minutes ['collapse', '-f', 'sample1.fq', '-o', 'outCollapsedSample1']

First lines of the file : @seq_5_x10 TAAGATCACTATGTCCGACT + AAAAAEDEEEDEDEDDEEDB @seq_9_x2 CAGCCGACTTAGAACTGGTGCG + AAAAAECE:EC:EEEE:E/C:C

For the prepare sample step :

INFO Run prepare INFO Reading sequeces INFO Creating matrix with unique sequences INFO Filtering: min counts 10, min size 18, max size 35, min shared 2 INFO Finish preprocessing. Get a sorted BAM file of seqs.fa and run seqcluster cluster. INFO It took 0.080 minutes ['prepare', '-c', 'designSeqcluster', '-o', 'outSeqclusterPrepare']

First (and only) line of seqs.ma : id seq S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12

stats_prepare.tsv : total 1 S1 added 0 S1 total 1 S2 added 0 S2 total 1 S3 added 0 S3 total 1 S4 added 0 S4 total 1 S5 added 0 S5 total 1 S6 added 0 S6 total 1 S7 added 0 S7 total 1 S8 added 0 S8 total 1 S9 added 0 S9 total 1 S10 added 0 S10 total 1 S11 added 0 S11 total 1 S12 added 0 S12

I hope it will be sufficient.

Thanks again :) Alexandra

lpantano commented 8 years ago

Hi,

I will try to update the docker then. You can update tools in there is you want for sure meanwhile, sorry about the error.

cheers

On Jul 6, 2016, at 4:20 AM, AlexandraBomane notifications@github.com wrote:

Hi Ipantano,

I use the version contained in the Docker image "lpantano/bcbio-srnaseq:v1". I guess it is not the last version. Would it be possible to have it in a Docker image ? I think it would be very convenient.

This is an example of what I get for the collapse step (I trimmed adapter before) : docker run --rm lpantano/bcbio-srnaseq:v1 bash -c "/usr/local/bcbio/anaconda/bin/./seqcluster collapse -f sample1.fq -o outCollapsedSample1"

INFO Run collapse INFO writing output INFO It took 1.749 minutes ['collapse', '-f', 'sample1.fq', '-o', 'outCollapsedSample1']

First line of the file : @seq_5_x10 TAAGATCACTATGTCCGACT + AAAAAEDEEEDEDEDDEEDB @seq_9_x2 CAGCCGACTTAGAACTGGTGCG + AAAAAECE:EC:EEEE:E/C:C

For the prepare sample step :

My configuration : sample1_trimmed.fastq S1 sample2_trimmed.fastq S2 sample3_trimmed.fastq S3 sample4_trimmed.fastq S4 sample5_trimmed.fastq S5 sample6_trimmed.fastq S6 sample7_trimmed.fastq S7 sample8_trimmed.fastq S8 sample9_trimmed.fastq S9 sample10_trimmed.fastq S10 sample11_trimmed.fastq S11 sample12_trimmed.fastq S12

My command : docker run --rm lpantano/bcbio-srnaseq:v1 bash -c "/usr/local/bcbio/anaconda/bin/./seqcluster prepare -c designSeqcluster -o outSeqclusterPrepare"

INFO Run prepare INFO Reading sequeces INFO Creating matrix with unique sequences INFO Filtering: min counts 10, min size 18, max size 35, min shared 2 INFO Finish preprocessing. Get a sorted BAM file of seqs.fa and run seqcluster cluster. INFO It took 0.080 minutes ['prepare', '-c', 'designSeqcluster', '-o', 'outSeqclusterPrepare']

First (and only) line of seqs.ma : id seq S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12

stats_prepare.tsv 👍 total 1 S1 added 0 S1 total 1 S2 added 0 S2 total 1 S3 added 0 S3 total 1 S4 added 0 S4 total 1 S5 added 0 S5 total 1 S6 added 0 S6 total 1 S7 added 0 S7 total 1 S8 added 0 S8 total 1 S9 added 0 S9 total 1 S10 added 0 S10 total 1 S11 added 0 S11 total 1 S12 added 0 S12

I hope it will sufficient.

Thanks again :) Alexandra

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lpantano/seqcluster/issues/19#issuecomment-230708512, or mute the thread https://github.com/notifications/unsubscribe/ABi_HBiCgSSEpCRAmDOw3ge9Qz67QsUNks5qS2VOgaJpZM4JEX6Y.

AlexandraBomane commented 8 years ago

Hi Ipantano,

I have succeeded in build a local Docker image with seqcluster 1.2.2 and I managed the analysis until the HTLM Report step.

I just need details about the outputs :

Here, first lines of my annotations (GTF) : 1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; 1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";

So, I guess the program catches gene_name and gene_biotype automatically... ?

How can I use this file for differential expression (DESeq2) or other downstream analysis ?

  1. size_counts.tsv : What is the meaning of each column (there's no header) ?
    • Report step :

Can I have some details about each figure of the maps.html files ?

  1. Coverage along precursor : a) What does represent each line of the graph ? b) How the expression is normalized ?
  2. Positions on genome : How do you read the coordinates : "69905 6 144407802 144407822 - 10" ?
  3. Annotation : I guess it means "Annotations cluster" we retrieve in counts.tsv ?
  4. Details : How is the frequency of sequences is calculated ? (compared to what ?).

Cheers, Alexandra

lpantano commented 8 years ago

Nice! I answer below each question:

On Jul 7, 2016, at 11:28 AM, AlexandraBomane notifications@github.com wrote:

Hi Ipantano,

I have succeeded in build a local Docker image with seqcluster 1.2.2 and I managed the analysis until the HTLM Report step.

I just need details about the outputs :

Clustering step :

counts.tsv : I don't get the meaning of "nloci" & "ann" headers. What is counted for "nloci" ? Example of annotation cluster : '"snRNA"|"snRNA"::"RNU4-1","RNU4-1","RNU4-1"' --> what do mean "|" (pipe) and "::" ? Why RNU4-1 is written 3 times ? nloci will be 0 always that the cluster has been resolved successfully. For instance, it can happen that you got sequences you have a bunch of sequences mapping to hundreds of different places on the genome, then seqcluster don’t resolve that, and put everything under the larger region covered by those sequences. So, mainly, 0 all are good rows. the annotation is just where the cluster overlap with. It can happen that appear many times the same feature if different locations of the cluster map to different copies. OR if the annotation file used had multiple lines for that. I will add this to my todo list, since it should be easy to only put one. Here, first lines of my annotations (GTF) : 1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; 1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1”;

So, I guess the program catch gene_name and gene_biotype automatically... ?

How can I use this file for differential expression (DESeq2) or other downstream analysis ?

size_counts.tsv : What is the meaning of each column (there's no header) ?

yeah, it means to play nice with this template: https://github.com/lpantano/seqcluster/blob/master/seqcluster/templates/report.rmd that you should have in the final folder. The numbers are the number of reads in each step for each sample, just to check if we missed many reads or not along the analysis. You can use the counts.tsv directly with DESeq2, they are raw count data. Report step : Can I have some details about each figure of the maps.html files ?

I would use the “seqcluster.db” file. To visualize that, you can download this repo: https://github.com/lpantano/seqclusterViz and open the reader.html in a browser, then you load that file and can go through different clusters. Figures are the same, just maybe better to play with. Lines are number of reads in that position of the precursor. It is log2 RPM of the expression for each sequence. Annotation is the same than the counts, maybe more complete. The positions are better in this report, it will be chr:start-end. In the html there is a table at the end, and those are raw counts.

Coverage along precursor : a) What does represent each line of the graph ? b) How the expression is normalized ?

Positions on genome : How do you read the coordinates : "69905 6 144407802 144407822 - 10" ?

Annotation : I guess it means "Annotations cluster" we retrieve in counts.tsv ?

Details : How is the frequency of sequences is calculated ? (compared to what ?).

Cheers, Alexandra

hope this helps, they are very useful question that I will use to add more documentation. thanks a lot — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lpantano/seqcluster/issues/19#issuecomment-231113774, or mute the thread https://github.com/notifications/unsubscribe/ABi_HLyumpu6i4ySfTfFOADrCQDJNKioks5qTRssgaJpZM4JEX6Y.

lpantano commented 8 years ago

sorry, I gave you the wrong file. This is something that I am working on. it is mainly the position, number of reads, cluster.

On Jul 7, 2016, at 11:28 AM, AlexandraBomane notifications@github.com wrote:

size_counts.tsv :

AlexandraBomane commented 8 years ago

Hello !

Thanks for this quick answer :) I'm working on files integrating your informations. I have another little question : in my counts.tsv, I have some lines where I find only "|" (pipe) in the "ann" column and I have counting for each sample. Example : "130 0 | 882 1165 949 479 1454 2129 535 624 403 883 922 944" Is it a bug or something like that ?

When I talk about "different lines" of the graph in HTML report, I ask why there are different lines with different colors --> what does represent a color ? ---> I get a color = a sample, but I don't know which line match with which sample.

I try seqclusterViz, but I think it doesn't work because when I download my database I can't do anything :(

Thanks, Alexandra

lpantano commented 8 years ago

nice it helped.

When you see only ‘|’ is just that position didn’t overlap to any region that are in the GTF provided during analysis.

On Jul 8, 2016, at 5:33 AM, AlexandraBomane notifications@github.com wrote:

Hello !

Thanks for this quick answer :) I'm working on files integrating your informations. I have another little question : in my counts.tsv, I have some lines where I find only "|" (pipe) in the "ann" column and I have counting for each sample. Example : "130 0 | 882 1165 949 479 1454 2129 535 624 403 883 922 944" Is it a bug or something like that ?

Alexandra

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lpantano/seqcluster/issues/19#issuecomment-231317139, or mute the thread https://github.com/notifications/unsubscribe/ABi_HFQc-ipzKQs-s_WrIvVUUA8WnMJtks5qThl8gaJpZM4JEX6Y.

lpantano commented 8 years ago

can i asked if there is any reason to not use bcbio-nextgen? the tool that wraps the whole small RNA analysis? You can get much more results and outputs from there.

just curious about it, and see if you tried but you had some issue or else.

thanks for try our tool!

On Jul 8, 2016, at 5:33 AM, AlexandraBomane notifications@github.com wrote:

Hello !

Thanks for this quick answer :) I'm working on files integrating your informations. I have another little question : in my counts.tsv, I have some lines where I find only "|" (pipe) in the "ann" column and I have counting for each sample. Example : "130 0 | 882 1165 949 479 1454 2129 535 624 403 883 922 944" Is it a bug or something like that ?

Alexandra

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lpantano/seqcluster/issues/19#issuecomment-231317139, or mute the thread https://github.com/notifications/unsubscribe/ABi_HFQc-ipzKQs-s_WrIvVUUA8WnMJtks5qThl8gaJpZM4JEX6Y.

AlexandraBomane commented 8 years ago

Hello !

I don't use bcbio-nextgen because I'm using another pipeline (Eoulsan : http://outils.genomique.biologie.ens.fr/eoulsan2/) to analyse small RNA-seq data. I'm particularly interested by seqcluster to analyse clusters of multi-reads.

Cheers, Alexandra

AlexandraBomane commented 8 years ago

Hi lpantano !

I wanted to know if it is possible to build annotations of clusters at clustering step with "gene_id" rather than "gene_name" ?

Thanks, Alexandra

lpantano commented 8 years ago

I can make that a option. Like find one instead of another. That will work for you?

On Aug 4, 2016, at 11:30 AM, Alexandra Bomane notifications@github.com wrote:

Hi lpantano !

I wanted to know if it is possible to build annotations of clusters at clustering step with "gene_id" rather than "gene_name" ?

Thanks, Alexandra

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lpantano/seqcluster/issues/19#issuecomment-237589810, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HLykRzHyh-Q20bE1R4-2yollcuXKks5qcgWjgaJpZM4JEX6Y.

AlexandraBomane commented 8 years ago

Hello,

This is exactly what I need.

Thanks, Alexandra

lpantano commented 8 years ago

just to be clear, it would use gene_id if gene_name is not there. Hope this helps. Normally gene_name has a better definition of what it is in many GTF, so I will add gene_id as a second option.

On Aug 5, 2016, at 5:05 AM, Alexandra Bomane notifications@github.com wrote:

Hello,

This is exactly what I need.

Thanks, Alexandra

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lpantano/seqcluster/issues/19#issuecomment-237799531, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HBQ7jgvDXHqJKlCWGPibOcaQnyM6ks5qcvy_gaJpZM4JEX6Y.

AlexandraBomane commented 8 years ago

My idea was an option similar to the "--idattr" option of HTSeq-Count. Because sometimes when I search the gene name in Uniprot, I can have some ambiguities whereas the gene_id is not ambiguous at all (to my mind).

When you say "it would use gene_id if gene_name is not there", do you mean if gene_id is not the the attribute field of the GTF ?

Thanks, Alexandra

lpantano commented 8 years ago

sounds good, i will add this tomorrow.

thanks

sent not from my computer

On Aug 8, 2016, at 04:41, Alexandra Bomane notifications@github.com wrote:

My idea was an option similar to the "--idattr" option of HTSeq-Count. Because sometimes when I search the gene name in Uniprot, I can have some ambiguities whereas the gene_id is not ambiguous at all (to my mind).

When you say "it would use gene_id if gene_name is not there", do you mean if gene_id is not the the attribute field of the GTF ?

Thanks, Alexandra

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

lpantano commented 8 years ago

Hi,

I updated the github repo. It should be able to say what to use through --feature_id option in the command line.

Feel free to test it and give some feedback.

cheers

AlexandraBomane commented 8 years ago

Hi !

I tested the --feature_id option and it works perfectly.

Thank you very much ;) Alexandra

AlexandraBomane commented 7 years ago

Hello !

I have another problem with the "prepare samples" Step : I run the command : seqcluster prepare -c seqclusterPrepareConfiguration.txt -o multi_reads_analysis/prepareSamplesStep --minc 2 --minl 25 --maxl 35 --min-shared 2

I get this stdin output :

INFO Run prepare INFO Reading sequeces INFO S2_demultiplex.1000000: Total read 102192 ; Total added 44968 INFO S8_demultiplex.1000000: Total read 189001 ; Total added 68719 INFO S4_demultiplex.1000000: Total read 289138 ; Total added 110547 INFO S10_demultiplex.1000000: Total read 386815 ; Total added 133646 INFO S6_demultiplex.1000000: Total read 445610 ; Total added 149472 INFO S12_demultiplex.1000000: Total read 500898 ; Total added 163542 INFO Creating matrix with unique sequences INFO Filtering: min counts 2, min size 25, max size 35, min shared 2 INFO Total skipped due to --min-shared parameter (6) : 163542 INFO Finish preprocessing. Get a sorted BAM file of seqs.fa and run seqcluster cluster. INFO It took 0.332 minutes ['prepare', '-c', 'seqclusterPrepareConfiguration.txt', '-o', 'multi_reads_analysis/prepareSamplesStep', '--minc', '2', '--minl', '25', '--maxl', '35', '--min-shared', '2']

But, my output files are empty (just as in my first post) : drwxr-xr-x 2 user user 4,0K sept. 16 14:58 log -rw-r--r-- 1 user user 0 sept. 16 14:58 seqs.fastq -rw-r--r-- 1 user user 146 sept. 16 14:59 seqs.ma -rw-r--r-- 1 user user 434 sept. 16 14:59 stats_prepare.tsv

Have you an idea of what is happening ?

Thanks, Alexandra

lpantano commented 7 years ago

Hi,

can you try to change some parameters and see if you still get empty, like very relaxed ones, —shared 0.

If that gets no output, I will try to debug further.

thanks

On Sep 16, 2016, at 9:04 AM, Alexandra Bomane notifications@github.com wrote:

Hello !

I have another problem with the "prepare samples" Step : I run the command : seqcluster prepare -c seqclusterPrepareConfiguration.txt -o multi_reads_analysis/prepareSamplesStep --minc 2 --minl 25 --maxl 35 --min-shared 2

I get this stdin output :

INFO Run prepare INFO Reading sequeces INFO S2_demultiplex.1000000: Total read 102192 ; Total added 44968 INFO S8_demultiplex.1000000: Total read 189001 ; Total added 68719 INFO S4_demultiplex.1000000: Total read 289138 ; Total added 110547 INFO S10_demultiplex.1000000: Total read 386815 ; Total added 133646 INFO S6_demultiplex.1000000: Total read 445610 ; Total added 149472 INFO S12_demultiplex.1000000: Total read 500898 ; Total added 163542 INFO Creating matrix with unique sequences INFO Filtering: min counts 2, min size 25, max size 35, min shared 2 INFO Total skipped due to --min-shared parameter (6) : 163542 INFO Finish preprocessing. Get a sorted BAM file of seqs.fa and run seqcluster cluster. INFO It took 0.332 minutes ['prepare', '-c', 'seqclusterPrepareConfiguration.txt', '-o', 'multi_reads_analysis/prepareSamplesStep', '--minc', '2', '--minl', '25', '--maxl', '35', '--min-shared', '2']

But, my output files are empty (just as in my first post) : drwxr-xr-x 2 user user 4,0K sept. 16 14:58 log -rw-r--r-- 1 user user 0 sept. 16 14:58 seqs.fastq -rw-r--r-- 1 user user 146 sept. 16 14:59 seqs.ma -rw-r--r-- 1 user user 434 sept. 16 14:59 stats_prepare.tsv

Have you an idea of what is happening ?

Thanks, Alexandra

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lpantano/seqcluster/issues/19#issuecomment-247595269, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HNcIz37In6jTgvf4Np2g_x0y4SBPks5qqpPggaJpZM4JEX6Y.

AlexandraBomane commented 7 years ago

Hi,

The problem persists even if I set --min-shared to 0.

Thanks

lpantano commented 7 years ago

Can you send me the first 100000 lines of one of the samples so I can test some things?

On Sep 16, 2016, at 9:59 AM, Lorena Pantano lorena.pantano@gmail.com wrote:

Hi,

can you try to change some parameters and see if you still get empty, like very relaxed ones, —shared 0.

If that gets no output, I will try to debug further.

thanks

On Sep 16, 2016, at 9:04 AM, Alexandra Bomane <notifications@github.com mailto:notifications@github.com> wrote:

Hello !

I have another problem with the "prepare samples" Step : I run the command : seqcluster prepare -c seqclusterPrepareConfiguration.txt -o multi_reads_analysis/prepareSamplesStep --minc 2 --minl 25 --maxl 35 --min-shared 2

I get this stdin output :

INFO Run prepare INFO Reading sequeces INFO S2_demultiplex.1000000: Total read 102192 ; Total added 44968 INFO S8_demultiplex.1000000: Total read 189001 ; Total added 68719 INFO S4_demultiplex.1000000: Total read 289138 ; Total added 110547 INFO S10_demultiplex.1000000: Total read 386815 ; Total added 133646 INFO S6_demultiplex.1000000: Total read 445610 ; Total added 149472 INFO S12_demultiplex.1000000: Total read 500898 ; Total added 163542 INFO Creating matrix with unique sequences INFO Filtering: min counts 2, min size 25, max size 35, min shared 2 INFO Total skipped due to --min-shared parameter (6) : 163542 INFO Finish preprocessing. Get a sorted BAM file of seqs.fa and run seqcluster cluster. INFO It took 0.332 minutes ['prepare', '-c', 'seqclusterPrepareConfiguration.txt', '-o', 'multi_reads_analysis/prepareSamplesStep', '--minc', '2', '--minl', '25', '--maxl', '35', '--min-shared', '2']

But, my output files are empty (just as in my first post) : drwxr-xr-x 2 user user 4,0K sept. 16 14:58 log -rw-r--r-- 1 user user 0 sept. 16 14:58 seqs.fastq -rw-r--r-- 1 user user 146 sept. 16 14:59 seqs.ma http://seqs.ma/ -rw-r--r-- 1 user user 434 sept. 16 14:59 stats_prepare.tsv

Have you an idea of what is happening ?

Thanks, Alexandra

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lpantano/seqcluster/issues/19#issuecomment-247595269, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HNcIz37In6jTgvf4Np2g_x0y4SBPks5qqpPggaJpZM4JEX6Y.

AlexandraBomane commented 7 years ago

Here the first 100000 lines of one of my sample. S2_sample100000_collapsed.fastq.gz

Thank you for your availability :)

lpantano commented 7 years ago

I get it working if I work with this sample and a copy of this sample. I am wondering if you don’t have the last version.

If you create another file with another sample with the top 100000 reads, and run again the command and don’t get anything, can you send me that other file as well, so we are sure I can run a command with the same data that is not working for you.

If that works in my end, then is should be a version problem. Maybe I fixed something in between.

thanks for helping with that.

On Sep 16, 2016, at 10:27 AM, Alexandra Bomane notifications@github.com wrote:

Here the first 100000 lines of one of my sample. S2_sample100000_collapsed.fastq.gz https://github.com/lpantano/seqcluster/files/477046/S2_sample100000_collapsed.fastq.gz Thank you for your availability :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lpantano/seqcluster/issues/19#issuecomment-247614698, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HBXSiCgC0MthEeIPJS9l9NqynPL1ks5qqqdfgaJpZM4JEX6Y.

AlexandraBomane commented 7 years ago

Hi Lorena,

I am currently using the latest version of seqcluster (1.2.3). I send you another file to check. S8_sample100000_collapsed.fastq.gz

Thanks again, Alexandra

AlexandraBomane commented 7 years ago

Hi Lorena,

I re-tested my command on others fastq data and it works. So I guess it was a problem with my samples.

Thank you again, Alexandra

lpantano commented 7 years ago

Hi,

I think I found it, can you install last dev from github and try again?

thanks!

AlexandraBomane commented 7 years ago

Hi,

How can I access to the "seqcluster" command from this github repository ? I can't find the executable "seqcluster". Should I use directly the seqcluster/prepare_data.py script ?

Thanks, Alexandra

AlexandraBomane commented 7 years ago

Hello,

I tested your modification, and I still have the same issue.

Cheers, Alexandra

lpantano commented 7 years ago

did you install it with the command seqcluster_install command? or clone the repository and used the python setup.py install to get the dev version?

I was pretty sure was that, it was setting —min-shared to the number of sample, what is kind of difficult to happen.

if you do this for the two files you sent me, do you get something?

seqcluster prepare -c config -o res -l 15 -u 35 -e 2 --min-shared 1

where config is:

S2_sample100000_collapsed.fastq.gz s1 S8_sample100000_collapsed.fastq.gz s2

I get results in this case, while before I was getting 0 sequences.

If you get my example working, but is not with all of your samples, then I would ask you to create a test example that is failing for you with the last dev version, so I can reproduce and find the new bug.

Sorry about this, that fix should have worked out.

thanks for you patient.

On Sep 20, 2016, at 5:45 AM, Alexandra Bomane notifications@github.com wrote:

Hello,

I tested your modification, and I still have the same issue.

Cheers, Alexandra

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lpantano/seqcluster/issues/19#issuecomment-248254328, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HGzbhVojU7CrvYy4OOqcBfYvELvvks5qr6scgaJpZM4JEX6Y.

AlexandraBomane commented 7 years ago

Hi Lorena,

I used the python setup.py install to get the dev version.

Using "--min-shared 1" instead of "--min-shared 2" with all my samples, I got results in my outputs :

user@e2de7afa0869:/home/multi_reads_analysis/prepareStep# ls -lh total 23M drwxr-xr-x 2 user user 4.0K Sep 21 08:05 log -rw-r--r-- 1 user user 14M Sep 21 08:05 seqs.fastq -rw-r--r-- 1 user user 9.1M Sep 21 08:05 seqs.ma -rw-r--r-- 1 user user 434 Sep 21 08:05 stats_prepare.tsv

My command to get this result was exactly : seqcluster prepare -c seqclusterPrepareConfiguration.txt -o multi_reads_analysis/prepareStep -l 25 -u 35 -e 2 --min-shared 1

Thanks, Alexandra

lpantano commented 7 years ago

I'll close this. I hoped it worked at the end!