Open johnomics opened 7 years ago
... min_len avg_len max_len sum_gap N50 L50
... 39 103 2,354 0 101 10,075
... min_len avg_len max_len sum_gap N50_len N50_num
... 39 103 2,354 0 101 10,075
... min_len avg_len max_len sum_gap N50 L50(N50_num)
... 39 103 2,354 0 101 10,075
Does these read better?
I'm afraid we can only add some explanation before making it more confusing. :)
Thanks for looking at this so quickly. I think the second version (N50_len and N50_num) works well - clear and compact. It would be better not to use L50 to refer to the N50 number at all - I think this usage should be avoided, even if it is found elsewhere.
Just my opinion though - some context and debate here and here.
Thanks John, let's just discard the L50 which brings confusion.
Great, thank you.
I feel like the conclusion of this thread was that you should use N50_num and N50_len (rather than L50), but then the implementation was that you just to remove N50_num altogether. I agree with @johnomics that L50 is confusing and N50_num is more appropriate, but I disagree with it's removal entirely. I would recommend putting N50_num back into seqkit stats.
Just checked the code. L50 (N50_num) is computed but hidden. :smile:
Hi Shenwei @shenwei356 ,
Sorry to jump in here, but I think this thread might be the best place to discuss my request. I guess N50 or L50 is not confusing to people anymore since high-throughput sequencing technologies are so common today (compared to 2017). I completely agree with @RhettRautsaw, I think it is time now to bring Lx stats back to seqkit. This would be very cool for large pangenome projects using only seqkit to calculate all the stats wanted. What do you think?
By the way, I really like seqkit! Thank you very much for providing this efficient and versatile tool for the world.
Best wishes, Yutang
Just added a new column N50_num
(L50).
$ echo -ne "aa\naaa\naaaa\naaaaa\naaaaaa\naaaaaaa\naaaaaaaa\naaaaaaaaa\naaaaaaaaaa\n" | csvtk mutate -Ht | seqkit tab2fx \
| seqkit fx2tab -l -n | csvtk add-header -t -n seq,len | csvtk pretty -t
seq len
---------- ---
aa 2
aaa 3
aaaa 4
aaaaa 5
aaaaaa 6
aaaaaaa 7
aaaaaaaa 8
aaaaaaaaa 9
aaaaaaaaaa 10
$ echo -ne "aa\naaa\naaaa\naaaaa\naaaaaa\naaaaaaa\naaaaaaaa\naaaaaaaaa\naaaaaaaaaa\n" | csvtk mutate -Ht | seqkit tab2fx \
| seqkit stats -a
file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 N50_num Q20(%) Q30(%) AvgQual GC(%)
- FASTA DNA 9 54 2 6 10 4 6 8 0 8 3 0 0 0 0
Wow, what a fast reply @shenwei356. Thank you very much.
I know I am asking too much, but it would be great to also support -L just like -N so that we can calculate -L 50, 90. What do you think? I really appreciate your work!
Best wishes, Yutang
Prerequisites
seqkit version
Describe your issue
Thanks for building seqkit, it is an extremely useful tool that I use every day.
seqkit stats -a
produces N50 and L50 statistics. These labels are very confusing; 'N50' is the 'N50 length', the length of read such that 50% of the bases are in reads of this length or longer. 'L50' is the 'N50 number', the number of reads in this set. The term L50 has no connection with its meaning and in fact suggests it is to do with a length, which is not true. It would be much better to to use the terms 'N50 length' and 'N50 number' (or similar terms) to make the meaning of these statistics clear. I realise other tools use the same jargon but it is unclear and would be better replaced.