miRTop / incubator

Where all ideas and discussions happen to lead to new repositories
5 stars 4 forks source link

GFF3::seqID #12

Open lpantano opened 7 years ago

lpantano commented 7 years ago

Hi all again!

cc: @lpantano @gurgese @ThomasDesvignes @mhalushka @mlhack @keilbeck @BastianFromm @ivlachos @TJU-CMC

I will start a issue column type at a time. Let's see if that makes easy to get as least a few people commenting.

The first column is for chromosome ID. That brings the discussion whether we should use genomic position or precursor position. Or allow both if in the header we can get the exact version-database for the precursor.

I am starting to be more incline to use genomic position because that should be the same among databases. With the condition to have the hairpin as a parent feature in the file as well. It will be like:

chr1 mirbase hairpin start end .... id=hairpin1
chr1 mirbase mirna start end ... parent=hairpin1

The only think I am not clear is what we do with the miRNAs that have multiple precursor on the genome. I can only think about adding an attribute like other_parents=hairpin2,hairpin3... and those parents should be in the GFF3 file as well.

Please comment with new ideas, or if you agree, disagree, missing scenario I am missing...

Thanks!

mhalushka commented 7 years ago

I believe the genomic position is the best option there. miRNAs that map to multiple locations can be designated with the additional numbering as you suggested.

lpantano commented 7 years ago

Thanks, I'll keep open, but I'll move to next question.

ThomasDesvignes commented 7 years ago

In our own miRNA-Seq analyzing tool that we are finalizing (yes, one more tool in a quite already large toolbox...), the way we work around that is by creating "genomic_location" groups of sequences that share the same unique or multiple genomic location origins. We allow a wiggle of [user-defined] nucleotides to group isomiRs together, and sequences can be in the same genomic location group only if they share the same location set. For ex, an isomiR that maps to two genomic locations won't be in the same genomic location group as an isomiR that maps to only one of the two locations. I'm not sure that helps much for this specific task but I think that choosing one genomic location over another when a sequence is as likely to come from one or another is creating a bias issue... In our software, the genomic location ID refers to a series of genomic location with for example: "11:27256137-27256115;5:29390312-29390334" showing two putative locations for an isomiR group with embedded information about the strand too (here the first location is on the reverse strand, while the second location is on the forward strand).

Bastami commented 7 years ago

In my opinion, as Thomas pointed out, embedding information about the strand is critical, as there are many examples of miRNAs that are located in the opposite strands of the same genomic position (e.g. hsa-mir-499a & hsa-mir-499b). Regarding miRNAs that belong to multiple precursors, I think no bias occurs as far as all parents are recorded in the GFF3 file.

FlorianThibord commented 5 years ago

Dear members of the mirtop project,

I've been adding support to a miRNA seq pipeline for outputting in miRGFF3 format, and I'm having doubt concerning this seqID value. I've seen in the examples mentioned here and in the preprint that the precursor ID should be mentioned in this column. What about the mature ID? Could it be mentioned instead? Or would it create compatibility issues when using mirtop? I'm currently aligning to mature sequences and I thought that it would be more coherent when aligning to mature sequences to mention the mature ident as SeqID (and start/end where the read aligned on this mature sequence). Has this been discussed before? (I've been browsing the issues but did not find a topic relevant to this).

And on a side note, thank you for developing this, I've been struggling with isomiR definition myself, and this will be a very usefull project for the miRNA community!

lpantano commented 5 years ago

Hi @FlorianThibord,

Thanks so much for the question. I think we didn't think about this, but it is a valid point.

we can try to adapt our tool to be compatible with that. It shouldn't be a lot of work but I would need some test file to work with. Normally we work all the time with the same sequences to test the tool and all the functions we code.

Just for curiosity, do you detect isomiRs that are -2nt at 5p the reference sequence? In that case, do you have information about these 2 nts map to the precursor or you just don't look at that?

Let me know if this plan will work with you and I will send you the sequences I need to have in the GFF3 format you are producing where the seqID is the mature one.

Thanks! :)

PS: You are welcome to join if you want to be more involved, let me know!

FlorianThibord commented 5 years ago

Thanks @lpantano for your reply,

Or course, I'll gladly produce some test files in that format if it will help. I can still detect 5' or 3' addition, and determine if these are templated or not by comparing with the nucleotides surrounding the mature in the hairpin sequence(s). So I'm able to detect isomiRs with iso_5p:-2 variants.

And sure I'd be happy to bring my modest contribution to the project!

lpantano commented 5 years ago

Perfect. Can you give me back the format you create when you use this as input: https://github.com/miRTop/incubator/blob/master/synthetic/synthetic/synthetic_100_full.fq

It has the standard illumina adapter:TGGAATTCTCGGGTGCCAAGGAACTC

Can you tell me the affiliation you want to use to join the team?

Thanks

FlorianThibord commented 5 years ago

Great I'll get working on it asap. Also, I'll get back to you concerning my affiliation

FlorianThibord commented 5 years ago

Hi, You'll find the resulting gff file here: synthetic_100_full.gff3.tar.gz You might notice the presence of an additional attribute (Expression_OptimiR) which corresponds to the final expression computed by my pipeline. I'm not sure about how I should mention it in there.

Concerning my affiliation: Florian Thibord, Phd student. INSERM UMR_S 1219, Bordeaux Population Health Research Center, University of Bordeaux, Bordeaux, France Thanks!

lpantano commented 5 years ago

Hi @FlorianThibord

Thanks for doing this. I think is almost perfect.

I have a couple of requests only:

The version in the file is correct but the UID is from version 1.0. We moved to a more commonly used id by Mintplate. Any way could use the dev branch in mirtop to create the ID. I think you took it from master, I am sorry I forgot to mention this.

Other minor details:

After that, it would be pretty easy to integrate this into mirtop!

Thanks again!

FlorianThibord commented 5 years ago

Hi, thanks for the feedback. Yes I did take the UID from the master branch, and did not check for the version match. I will look into the dev branch to make the necessary changes. Otherwise, the minor details should be easy to fix! I will get back to you when it's done.

lpantano commented 5 years ago

Hey @FlorianThibord

Did you have a chance to update the UID? if not, you can remove it and I will adapt mirtop to be compatible with that as far as you add the sequence to the line.

Thanks!

FlorianThibord commented 5 years ago

Hi @lpantano Yes, sorry for the delay, I made the changes and I think the format is compatible now. Here is the new file processed with optimiR : synthetic_100_full.gff3.tar.gz Let me know if there is something I can do to help

lpantano commented 5 years ago

Hi @FlorianThibord ,

I think is almost there. I noticed a couple of typos:

Thanks a bunch! we are almost there.

FlorianThibord commented 5 years ago

Thanks @lpantano , I discarded the Changes attribute since it's not mandatory, and mirtop can fill the field if necessary. I also added "NA" to the Variant attribute when there is no variants. I think third time's the charm! Here is the file : synthetic_100_full.gff3.tar.gz