steuernb / NLR-Annotator

NLR-Annotator upload
GNU General Public License v3.0
56 stars 24 forks source link

Clarify different output formats #32

Closed bmansfeld closed 1 year ago

bmansfeld commented 1 year ago

Hi thanks for developing, updating and being responsive. NLR-Annotator v2 installs and works really great for me, especially when compared to other options out there.

I have a few questions about the format of the different outputs options, gff, bed and txt. Could you elaborate (perhaps also on the readme page) about the format? Specifically, it looks like .txt output shares the motifs within each nlr annotated, while the .bed reports some integer values which I assumed where motifs or positions but after closer inspection the numbers didn't really makes sense to me.

Also, for outputing motifs.bed, I assumed these would be grouped with the nlrs annotated in the txt/gff file but it appears that those are just general motifs detected?

Finally, I'm about to try the -a option, and would like to compare multiple spp annotated in this way. Is this appropriate? would I just need to cat the output from that option from each spp run?

Thanks in advance, Ben

steuernb commented 1 year ago

Hi Ben, thanks for using NLR-Annotator! Here is a useful web page describing gff and bed format in general: http://genome.ucsc.edu/FAQ/FAQformat . The gff format (-g) is general feature format - you should be able to use that in genome browsers such as IGV. Same is true for bed (-b). Depends on which format you prefer. I like bed because I can add a colour. The second bed file (-m) should display all motifs being found in proximity to an NLR locus. You can use the -m output to manually check that the annotation is correct. It will also help you to see the structure of the NLR. The motifs have been defined by Jupe at al. 2012. If you want to create a phylogeny of your NLRs most people would use the protein sequence of the NB-ARC domain. Without a gene annotation we are not able to determine the full protein sequence. The -a option presents a surrogate, which I find works quite well. It takes the motifs (here we know the position and the protein sequence) of the NB-ARC domain and concatenates them. It also creates a multiple alignment of them based on the canonical structure of an NB-ARC domain. You can feed that fasta file into programs such as FastTree to get your phylogeny. If you run Annotator with -a on two genomes, you can simply concatenate output files and run that on FastTree (make sure to mark the genome in each sequence name) Then, you visualize with something such as iTOL and you get a nice tree where you can see which clusters expand. I would be careful to make claims exclusively based on such an analysis but most times it will show the tendencies. I hope that helps a bit... best wishes Burkhard

acread commented 1 year ago

I'm a big NLR-Annotator fan. Here are a few things that helped me:

I do a quick scan to make sure I don't have duplicate rows in the NLR-Annotator output - it's pretty rare that I get dupes but it happens, I'm guessing it has to do with where the NLRs are in the sequence tiles?

One small detail with the -a output is that two sequences are added to each output (I think it's two) - this can be helpful as an outgroup, but you'll need to decide whether or not it makes sense to include them in your alignment/tree and be aware when concatenating

How to handle missing data in the alignments? -- you'll see that there are gaps in some of your aligned sequences, I'm not sure what cut-off is appropriate for inclusion/exclusion - I've used 50% gaps as a cut-off in the past.

When building trees/alignments duplicate NB-ARC domain sequences are sometimes collapsed -- this means the node of the tree may represent multiple NLRs.

Not all identified NLR-domain sequences include intact NB-ARC domains... this (and the above collapsing of duplicates) is why the numbers in the alignment and the number of putative NLR-regions is not the same

bmansfeld commented 1 year ago

Burkhard, Thanks for the quick reply!! Yes I am familiar with the original tabular formats for these files. It is true however, that some (other) software devs output a selection of these columns in a semi-/not-so- standard fashion. I was just curious to check if there were specific output columns you were using or not.

Specifically, I was hoping to have some clarity on the outputs in the last column of the bed file ie the comma separated values such as:

MfusH1_chr11 2172506 2176661 MfusH1_chr11_nlr1 0 + 2172506 2176661 0,255,0 43 108,51,63,87,87,45,45,60,60,60,45,45,45,63,63,63,45,45,45,87,87,87,63,63,63,63,63,57,123,57,57,57,57,57,57,57,57,57,57,57,57,57,57 0,183,600,675,675,831,831,915,915,915,1002,1002,1002,1107,1107,1107,1224,1224,1224,1269,1269,1269,1392,1392,1473,1743,1812,1881,2289,2424,2598,3048,3120,3189,3258,3354,3426,3510,3582,3660,3792,4029,4098

As I am not so familiar with this block notation and interested in the relevance it has to the domain identification ie in the txt file

Appreciate the -a option - I think it will be extremely helpful to get a general idea for the NLR phylogeny! Thanks for the detailed suggestions here.

Ben

steuernb commented 1 year ago

Hi Ben, that is the slightly funny way how things are recorded in a BED file. The 10th column says how many blocks. (in your case that is 43). The 11th column denotes how big each of the (43) blocks are. So, you should see as many comma separated values as denoted in 10th column. The 12th column then lists the starting points of the blocks. The starting point is with respect to the start of the locus, i.e. second column. Relevance for this program: The blocks in the last column make the position of the motifs visible. BED cannot include the names of the individual motifs here. So, if you want to know those, you need to load the second BED file from -m as well. Having both can be quite useful because there might be motifs overlapping with a locus but excluded from the annotation. That can be important for manual curation and finding problems where Annotator got it wrong. cheers Burkhard

steuernb commented 1 year ago

Hi acread, thanks for the comments! The two additional sequences in the -a output are indeed not well documented. Sorry about that. Should I exclude them? cheers Burkhard

bmansfeld commented 1 year ago

Thanks Burkhard very helpful. I have another issue with -a but ill open a separate thread for clarity. Thanks again for the software!