juliema / aTRAM

BSD 3-Clause "New" or "Revised" License
35 stars 14 forks source link

'stitcher' required files are not documented #289

Closed sestaton closed 4 years ago

sestaton commented 4 years ago

The help menu for the 'stitcher' program describes a taxon file that must be used, but it's not clear what format this should take or how it is used. There is a tutorial file for this program but it is empty, and I don't see any test files indicating the format.

My question is, what is the format of the taxon file and how is it used? There is an error message from 'stitcher' that indicates there should be a taxon name and contig name included in the assembly, but this is ambiguous without knowing what this means in terms of the format (I made a few guesses without success but decided to stop and ask).

I'm guessing there is a formatting step after the 'assembly' phase, or maybe the sequences going into the assembly should be in a certain format? The previous steps completed without error, suggesting the formats are as expected, so I'm looking for a little advice on the formatting.

juliema commented 4 years ago

Hey good question I will add some text to clarify this.

The taxon file is just a text file with the taxon names listed one one each line. The taxon names should correspond to the names of the atram libraries so that they correspond with the output files from atram.

hope that helps

On Wed, Apr 29, 2020 at 4:18 PM Evan Staton notifications@github.com wrote:

The help menu for the 'stitcher' program describes a taxon file that must be used, but it's not clear what format this should take or how it is used. There is a tutorial file for this program but it is empty, and I don't see any test files indicating the format.

My question is, what is the format of the taxon file and how is it used? There is an error message from 'stitcher' that indicates there should be a taxon name and contig name included in the assembly, but this is ambiguous without knowing what this means in terms of the format (I made a few guesses without success but decided to stop and ask).

I'm guessing there is a formatting step after the 'assembly' phase, or maybe the sequences going into the assembly should be in a certain format? The previous steps completed without error, suggesting the formats are as expected, so I'm looking for a little advice on the formatting.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/juliema/aTRAM/issues/289, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AZ3R5IAPMLMP6VPWALBLRPCDQFANCNFSM4MUBEKSQ .

sestaton commented 4 years ago

Okay, thanks for the response.

Just to be clear, should the taxon file include all species in the query and the reference species, or just the query?

And just for example, if I have one query (Lactuca sativa) and one reference (Helianthus annuus) species how would I format the taxon file and the assembly/reference specifically?

juliema commented 4 years ago

The taxon file does not include the reference sequences or taxa. It is specifically referring to the taxa that you assembled loci for.

The reference file should have the locus names in the name line that matches the locus name in the output file from atram - e.g. the locus names in your original target file.

The atram assembly outfile has the taxon name and the locus name in the name of the file. The stitcher will be looking at each taxon in your taxon list and matching those with the names of the loci from your reference file to build a single file for each locus that has all the stitched exons for each taxon.

let us know if you have any other questions.

On Wed, Apr 29, 2020 at 5:54 PM Evan Staton notifications@github.com wrote:

Okay, thanks for the response.

Just to be clear, should the taxon file include all species in the query and the reference species, or just the query?

And just for example, if I have one query (Lactuca sativa) and one reference (Helianthus annuus) species how would I format the taxon file and the assembly/reference specifically?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/juliema/aTRAM/issues/289#issuecomment-621485321, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AZ3UUORDBWSKCURTBKWLRPCOYRANCNFSM4MUBEKSQ .

sestaton commented 4 years ago

The taxon file does not include the reference sequences or taxa. It is specifically referring to the taxa that you assembled loci for.

Okay, this is what I assumed would be the case. Using the example I provided, I would have:

Lactuca sativa

in the taxon file.

The reference file should have the locus names in the name line that matches the locus name in the output file from atram - e.g. the locus names in your original target file.

Following the README, I used the same reference/target file for all steps and the gene/locus name is in the header so I assumed this is what is intended.

The atram assembly outfile has the taxon name and the locus name in the name of the file.

By locus are you referring to the reference species, or do you have to split the input assembly into individual FASTA files with the locus in the file name somehow? That step is easily achieved but getting there is unclear, specifically the "locus" name in the file part. What would that look like?

The stitcher will be looking at each taxon in your taxon list and matching those with the names of the loci from your reference file to build a single file for each locus that has all the stitched exons for each taxon.

This part is unclear. It is not clear what matching a taxon to a locus means or how it works. A locus being a gene name would take the form HaChr01g000001 whereas the query taxon would Lactuca sativa, using examples I provided above. There must be some formatting step I suppose or some other process (perhaps alternative term use) that has to happen.

juliema commented 4 years ago

Hi Evan,

Did you assemble a single locus? The reference file usually has many target loci in them in a fasta format with each locus being named it the nameline. Here are examples of those file.

Example taxon file TaxonA TaxonB

Example Reference file - the locus names are in the name line ( make sure it is AA not DNA)

CytB NFPPRP EF1a FRYI

so my atram output files usually looks like these:

taxonA.CytB.filtered_contigs.fasta taxonA.Ef1a.filtered_contigs.fasta taxonB.CytB.filtered_contigs.fasta taxonB.Ef1a.filtered_conitigs.fasta

the stitcher will find all the outfiles for each locus for each taxon (e.g. match the taxa names with the locus names) and produce a single file per locus with the stitched exons for each taxa in them. The names of those files will be

CytB.stitched_exons.fasta Ef1a.stitched_exons.fasta

Hope that helps clear things up. If you used the same reference file then that should be fine. It sounds like your reference file might be named by the reference taxon rather than the locus? That should be fine too. If you were trying to assemble many loci then those need to be separated in the reference file so aTRAM can assemble them one after the other. Use the -Q (not -q) in the atram assembly run.

On Thu, Apr 30, 2020 at 3:29 AM Evan Staton notifications@github.com wrote:

The taxon file does not include the reference sequences or taxa. It is specifically referring to the taxa that you assembled loci for.

Okay, this is what I assumed would be the case. Using the example I provided, I would have:

Lactuca sativa

in the taxon file.

The reference file should have the locus names in the name line that matches the locus name in the output file from atram - e.g. the locus names in your original target file.

Following the README, I used the same reference/target file for all steps and the gene/locus name is in the header so I assumed this is what is intended.

The atram assembly outfile has the taxon name and the locus name in the name of the file.

By locus are you referring to the reference species, or do you have to split the input assembly into individual FASTA files with the locus in the file name somehow? That step is easily achieved but getting there is unclear, specifically the "locus" name in the file part. What would that look like?

The stitcher will be looking at each taxon in your taxon list and matching those with the names of the loci from your reference file to build a single file for each locus that has all the stitched exons for each taxon.

I have no idea what this means. Matching a taxon to a locus? How? A locus being a gene name would take the form HaChr01g000001 whereas the query taxon would Lactuca sativa, using examples I provided above. There must be some formatting step I suppose or some other process (perhaps alternative term use) that has to happen.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/juliema/aTRAM/issues/289#issuecomment-621663862, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AZ3TXDSZFI7I6U45FYN3RPESGJANCNFSM4MUBEKSQ .

sestaton commented 4 years ago

Did you assemble a single locus? The reference file usually has many target loci in them in a fasta format with each locus being named it the nameline. Here are examples of those file.

The input reference file is a set of protein sequences from a single species.

Example taxon file TaxonA TaxonB

I think the format I'm using here is okay.

Example Reference file - the locus names are in the name line ( make sure it is AA not DNA) ...

I think this is fine also.

so my atram output files usually looks like these: taxonA.CytB.filtered_contigs.fasta taxonA.Ef1a.filtered_contigs.fasta taxonB.CytB.filtered_contigs.fasta taxonB.Ef1a.filtered_conitigs.fasta

This is where my results deviated. I get a single output file from the assembly step called "*all_contigs.fasta" as opposed to files with the locus name.

It sounds like your reference file might be named by the reference taxon rather than the locus? That should be fine too. If you were trying to assemble many loci then those need to be separated in the reference file so aTRAM can assemble them one after the other. Use the -Q (not -q) in the atram assembly run.

My reference is named for the species (e.g.,Helianthus_annuus.faa, using the examples I listed above). I am trying to assemble many loci and I think this should be formatted correct based on the comments above. Okay, I will look into the -Q option. The assembly ran without error, but I think the way I ran it did no produce the output expected for the stitching step, so I will try that and update the issue here.

Thanks for the advice.

sestaton commented 4 years ago

I have the assembly step running now with the -Q option, but I estimate it will take more than two months to finish (using one query and one target species). I'm curious if there is any practical way to speed up this process. I have specified 96 processors to be used for this step but it appears that the assemblies are done in a serial fashion, so I'm not taking full advantage of the CPUs and RAM I have available.

Also, I'm using spades for the assembly with the --careful option. It occurred to me this might slow things down, but based on how the data is being processed I don't see much time being spent on that step for each assembly.

Any advice would be helpful, thanks.

juliema commented 4 years ago

Hi Evan,

How many shards does your library have? That is will limit the number of processors you want to use. aTRAM will blast as many shards as possible (e.g. how many processors you give it) at once then it all cones together to a few processors for the assembly step. I have found spades is a bit slower than other assemblers and adding the careful option might slow it down.

You could try a few test runs with abyss or trinity to see how quickly it goes.

Also you can set up more than one run. For example, if you have 100 loci to assemble, set up one run to assemble 50 and other to assemble the other 50 and have them running simultaneously.

On Mon, May 4, 2020 at 12:39 PM Evan Staton notifications@github.com wrote:

I have the assembly step running now with the -Q option, but I estimate it will take more than two months to finish (using one query and one target species). I'm curious if there is any practical way to speed up this process. I have specified 96 processors to be used for this step but it appears that the assemblies are done in a serial fashion, so I'm not taking full advantage of the CPUs and RAM I have available.

Also, I'm using spades for the assembly with the --careful option. It occurred to me this might slow things down, but based on how the data is being processed I don't see much time being spent on that step for each assembly.

Any advice would be helpful, thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/juliema/aTRAM/issues/289#issuecomment-623573429, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AZ3UDE6PL6SKX3JICB5DRP3VU5ANCNFSM4MUBEKSQ .

sestaton commented 4 years ago

How many shards does your library have? That is will limit the number of processors you want to use.

Based on the preprocessor log it says there are 27 BLAST DBs. Is this what you are referring to as shards?

Also you can set up more than one run.

Do you mean split the query into chunks and run them independently? That makes sense to me. Analyzing loci on a per-chromosome basis could make a huge difference.

juliema commented 4 years ago

Hey Evan,

yes ok so if you have 27 shards then you probably dont want to assign more than 30 cores to any run. If you have 90 cores then you could set up three runs and split the total loci to assemble into three files and run each with 30 cores.

On Mon, May 4, 2020 at 12:53 PM Evan Staton notifications@github.com wrote:

How many shards does your library have? That is will limit the number of processors you want to use.

Based on the preprocessor log it says there are 27 BLAST DBs. Is this what you are referring to as shards?

Also you can set up more than one run.

Do you mean split the query into chunks and run them independently? That makes sense to me. Analyzing loci on a per-chromosome basis could make a huge difference.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/juliema/aTRAM/issues/289#issuecomment-623580297, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AZ3W5VEQGDPYEKGAO7TLRP3XIDANCNFSM4MUBEKSQ .

sestaton commented 4 years ago

yes ok so if you have 27 shards then you probably dont want to assign more than 30 cores to any run. If you have 90 cores then you could set up three runs and split the total loci to assemble into three files and run each with 30 cores.

Okay, I think I will explore parallelizing this run. Thanks.

sestaton commented 4 years ago

One more question: how does the --fraction option influence the run time and how should this option be used ideally? I was not sure if this should be left at the default or changed to capture more loci.

sestaton commented 4 years ago

Never mind the previous question. The fraction part seems pretty clear looking at the code.

I have the assembly step running in a parallel fashion now, so I should be able to get back to the original question about the 'stitching' phase once these processes are completed.

juliema commented 4 years ago

sounds good. the --fraction allows you to search only a part of the library. This is useful if you are trying assemble loci with really high coverage (e.g. mitochondrial and chloroplast genes) where you dont need to take the computational time to search the entire library, instead you will likely get a decent assembly with only a fraction of the library. This helps to save computational time.

On Tue, May 5, 2020 at 2:32 AM Evan Staton notifications@github.com wrote:

Never mind the previous question. The fraction part seems pretty clear looking at the code.

I have the assembly step running in a parallel fashion now, so I should be able to get back to the original question about the 'stitching' phase once these processes are completed.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/juliema/aTRAM/issues/289#issuecomment-623879244, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AZ3TLAFNLUKRAAVON5VLRP6XJTANCNFSM4MUBEKSQ .

sestaton commented 4 years ago

I have arrived back at the original question at least, but it is still not clear what is expected.

Here is the error message I'm getting: 2020-05-08 04:53:28 ERROR: There were no hits. Please make sure your contig file names contain both a reference gene name and taxon name in them.

The input taxon file is like so (one species):

taxonA

The input reference looks like:

>gene1
MSVNPSIGGEIPPPIPP

The input contigs are named like so:

taxonA.gene1.filtered_contigs.fasta

and these are in a directory called filtered_contigs. My invocation of the command was:

time ./aTRAM/atram_stitcher.py \
 -T taxon_file.txt \
 -t /mnt/nvme2n1/tmp \
 -o stitcher_out \
 --assemblies-dir=filtered_contigs \
 -f '*filtered_contigs.fasta' \
 --reference-genes=ref.faa  \
 --log-level debug \
 -l stitcher_debug_may7.log

The reference is the same that was used in previous steps. The error I'm getting, as noted above, says there must be a taxon and gene name in the file but this obviously is the case.

Any help on what the program is expecting would be helpful. Thanks.

juliema commented 4 years ago

hey Evan,

Give it a path to the contig files with the -a command. That should get you there.

On Fri, May 8, 2020 at 1:19 AM Evan Staton notifications@github.com wrote:

I have arrived back at the original question at least, but it is still not clear what is expected.

Here is the error message I'm getting: 2020-05-08 04:53:28 ERROR: There were no hits. Please make sure your contig file names contain both a reference gene name and taxon name in them.

The input taxon file is like so (one species):

taxonA

The input reference looks like:

gene1 MSVNPSIGGEIPPPIPP

The input contigs are named like so:

taxonA.gene1.filtered_contigs.fasta

and these are in a directory called filtered_contigs. My invocation of the command was:

time ./aTRAM/atram_stitcher.py \ -T taxon_file.txt \ -t /mnt/nvme2n1/tmp \ -o stitcher_out \ --assemblies-dir=filtered_contigs \ -f '*filtered_contigs.fasta' \ --reference-genes=ref.faa \ --log-level debug \ -l stitcher_debug_may7.log

The reference is the same that was used in previous steps. The error I'm getting, as noted above, says there must be a taxon and gene name in the file but this obviously is the case.

Any help on what the program is looking is expecting would helpful. Thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/juliema/aTRAM/issues/289#issuecomment-625635793, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AZ3W2FMW35YQGGEBY63DRQOI5LANCNFSM4MUBEKSQ .

sestaton commented 4 years ago

Thanks for the response.

Give it a path to the contig files with the -a command. That should get you there.

The usage suggests that the PATH to the assemblies can be given by -a, --assemblies-dir.

I specified that in the command above and tried the short arg version for completeness, but I get the same result.

juliema commented 4 years ago

are your 'filtered_contigs.fasta' files empty?

On Fri, May 8, 2020 at 12:19 PM Evan Staton notifications@github.com wrote:

Thanks for the response.

Give it a path to the contig files with the -a command. That should get you there.

The usage suggests that the PATH to the assemblies can be given by -a, --assemblies-dir.

I specified that in the command above and tried the short arg version for completeness, but I get the same result.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/juliema/aTRAM/issues/289#issuecomment-625894213, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AZ3SFFPX3T2O66DULNYTRQQWIRANCNFSM4MUBEKSQ .

juliema commented 4 years ago

also you might give it the full path to the assemblies directory and try that

On Fri, May 8, 2020 at 1:21 PM Julie Allen jallen23@unr.edu wrote:

are your 'filtered_contigs.fasta' files empty?

On Fri, May 8, 2020 at 12:19 PM Evan Staton notifications@github.com wrote:

Thanks for the response.

Give it a path to the contig files with the -a command. That should get you there.

The usage suggests that the PATH to the assemblies can be given by -a, --assemblies-dir.

I specified that in the command above and tried the short arg version for completeness, but I get the same result.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/juliema/aTRAM/issues/289#issuecomment-625894213, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AZ3SFFPX3T2O66DULNYTRQQWIRANCNFSM4MUBEKSQ .

sestaton commented 4 years ago

No the files are not empty. The example I posted was for simplicity by I actually pass an ENV var to the -a arg that is the full system path.

rafelafrance commented 4 years ago

Hey Evan, I'm sorry you're having so much trouble here. As you have found out, the stitcher still has some rough edges. Let's see if I can help... well I'm going to try to help you help yourself.

I hope you're OK with digging into databases.

I noticed that you are using the --temp-dir argument (-t), good for debugging. It you also use the --keep-temp-dir argument you'll save all of the temp files for further analysis. The one that I'm most interested in is the atram_stitcher.sqlite.db. That's where we save the temporary database that the stitcher uses.

To look at this table I use the "DB Browser for SQLite". It available for free. If you cannot get your hands on that then using the sqlite3 command line tool is a bit more cumbersome but it'll work.

Your error telling us that there are no records in the "exonerate" table. Let see if we can find out why. The first things to look at are the "contigs" table and the "reference_genes" table. Does every contig in the contigs table have a ref_name that matches a ref_name in the reference_genes table. So the steps are:

  1. Add the --keep-temp-dir option to the stitcher run. Then find the created temporary directory.
  2. Open the temp database (atram_stitcher.sqlite.db) with either the DB browser (or sqlite3 command line).
  3. Look at the "contigs" table and the "reference_genes" table.
    1. If you don't have the browser then command line select * from contigs; and select * from reference_genes; will show you the contents of the tables.
  4. We're most interested in the count of contigs in the contig table. If you have the DB browser then you can just click on the "Browse Data" tab and select the contigs table. It'll show you the count of records below. If you're using the sqlite3 command line then select count(*) from contigs;
  5. Make sure that every contig's ref_name has a corresponding entry in the reference_genes table.
    1. if you're OK with SQL then you can select count(*) from contigs where ref_name in (select ref_name from reference_genes); This should match the count above.

That's enough to get started.

sestaton commented 4 years ago

Thanks for the responses. Though, I don't think it makes sense for me to pursue this idea at this time.

Looking into databases is something I am definitely comfortable with but it needs to have a clear purpose. Spending time inspecting tables to "get started" seems like I'd be going down a rabbit hole, and not I don't see a way forward with that approach.

The solution seems likely to involve diving into the code and inserting breakpoints to inspect the issue and I think that task would be better left to the authors.

rafelafrance commented 4 years ago

I respect this decision, of course.

FYI: I was trying to bisect where the problem could be coming from. Is it a problem with the input data or with the exonerate assembly. But you are right in that depending on where the issue occurred there would be more debugging and without the offending data I can't do much here.

sestaton commented 4 years ago

I'd be happy to provide the data and scripts I'm using if you think it would help. I am using data from one species obtained from the SRA and mapping against sunflower.

The ultimate goal was to add this species to a phylogenetic tree to gain insights into gene family evolution. The relationships are already known so the placement of the genes could be informative.

And to be clear, I do want to help. This is a cool project but I'm burdened with a lot of other tasks at the moment.

rafelafrance commented 4 years ago

Your data would help for for sure. I cannot guarantee a fix but it would definitely make it much more likely. When you get the chance, please send your input files along (or links to them) as well as the related log file.

PS: I sorry if I came off as snarky and/or rude. Not my intent, even in the slightest. You've already been quite helpful in identifying the other bug and this is the kind of thing we want to foster here.

PPS: I understand being overburdened and I don't want to add to that. And just so you're aware, I'm on Dr. Allen's projects only on Fridays so if there is a lag it's because of that.

sestaton commented 4 years ago

I can send the links for sure. No reason to worry about the responses, I never perceived any tone other than professional and helpful, and they were very timely. I hope my comments were received in the same manner.

Also, when if comes to code I would always err on the side of being direct and descriptive rather than trying to be cordial. People always want a quick solution and will never fault you for trying to help in that respect.

juliema commented 4 years ago

Hey Evan also a quick FYI we edited a few things in the way atram finds files in the stitcher code. you might want to try it again and see if your problem is resolved.

thanks!

On Fri, May 8, 2020 at 4:46 PM Evan Staton notifications@github.com wrote:

I can send the links for sure. No reason to worry about the responses, I never perceived any tone other than professional and helpful, and they were very timely. I hope my comments were received in the same manner.

Also, when if comes to code I would always err on the side of being direct and descriptive rather than trying to be cordial. People always want a quick solution and will never fault you for trying to help in that respect.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/juliema/aTRAM/issues/289#issuecomment-626013221, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AZ3WSO7WRIXYJWO3O6Q3RQRVUDANCNFSM4MUBEKSQ .

sestaton commented 4 years ago

Thanks for the update. This could indeed be related, but when I have the chance to try this approach again I will share my findings related to the original issue.

sestaton commented 4 years ago

I see never shared a link to the files, so I'm sorry about that. That makes debugging very hard, but I will try to provide some useful information when I get the chance.