shenwei356 / seqkit

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
https://bioinf.shenwei.me/seqkit
MIT License
1.29k stars 158 forks source link

N50 and L50 jargon is confusing #15

Open johnomics opened 7 years ago

johnomics commented 7 years ago

Prerequisites

Describe your issue

Thanks for building seqkit, it is an extremely useful tool that I use every day.

seqkit stats -a produces N50 and L50 statistics. These labels are very confusing; 'N50' is the 'N50 length', the length of read such that 50% of the bases are in reads of this length or longer. 'L50' is the 'N50 number', the number of reads in this set. The term L50 has no connection with its meaning and in fact suggests it is to do with a length, which is not true. It would be much better to to use the terms 'N50 length' and 'N50 number' (or similar terms) to make the meaning of these statistics clear. I realise other tools use the same jargon but it is unclear and would be better replaced.

shenwei356 commented 7 years ago
... min_len  avg_len  max_len  sum_gap  N50     L50
...      39      103    2,354        0  101  10,075

... min_len  avg_len  max_len  sum_gap  N50_len  N50_num
...      39      103    2,354        0  101       10,075

... min_len  avg_len  max_len  sum_gap  N50  L50(N50_num)
...      39      103    2,354        0  101       10,075

Does these read better?

I'm afraid we can only add some explanation before making it more confusing. :)

johnomics commented 7 years ago

Thanks for looking at this so quickly. I think the second version (N50_len and N50_num) works well - clear and compact. It would be better not to use L50 to refer to the N50 number at all - I think this usage should be avoided, even if it is found elsewhere.

Just my opinion though - some context and debate here and here.

shenwei356 commented 7 years ago

Thanks John, let's just discard the L50 which brings confusion.

johnomics commented 7 years ago

Great, thank you.

RhettRautsaw commented 8 months ago

I feel like the conclusion of this thread was that you should use N50_num and N50_len (rather than L50), but then the implementation was that you just to remove N50_num altogether. I agree with @johnomics that L50 is confusing and N50_num is more appropriate, but I disagree with it's removal entirely. I would recommend putting N50_num back into seqkit stats.

shenwei356 commented 8 months ago

Just checked the code. L50 (N50_num) is computed but hidden. :smile:

Yutang-ETH commented 7 months ago

Hi Shenwei @shenwei356 ,

Sorry to jump in here, but I think this thread might be the best place to discuss my request. I guess N50 or L50 is not confusing to people anymore since high-throughput sequencing technologies are so common today (compared to 2017). I completely agree with @RhettRautsaw, I think it is time now to bring Lx stats back to seqkit. This would be very cool for large pangenome projects using only seqkit to calculate all the stats wanted. What do you think?

By the way, I really like seqkit! Thank you very much for providing this efficient and versatile tool for the world.

Best wishes, Yutang

shenwei356 commented 7 months ago

Just added a new column N50_num (L50).

$ echo -ne "aa\naaa\naaaa\naaaaa\naaaaaa\naaaaaaa\naaaaaaaa\naaaaaaaaa\naaaaaaaaaa\n" | csvtk mutate -Ht | seqkit tab2fx \
    | seqkit fx2tab -l -n | csvtk add-header -t -n seq,len | csvtk pretty -t
seq          len
----------   ---
aa           2  
aaa          3  
aaaa         4  
aaaaa        5  
aaaaaa       6  
aaaaaaa      7  
aaaaaaaa     8  
aaaaaaaaa    9  
aaaaaaaaaa   10

$ echo -ne "aa\naaa\naaaa\naaaaa\naaaaaa\naaaaaaa\naaaaaaaa\naaaaaaaaa\naaaaaaaaaa\n" | csvtk mutate -Ht | seqkit tab2fx \
    | seqkit stats -a
file  format  type  num_seqs  sum_len  min_len  avg_len  max_len  Q1  Q2  Q3  sum_gap  N50  N50_num  Q20(%)  Q30(%)  AvgQual  GC(%)
-     FASTA   DNA          9       54        2        6       10   4   6   8        0    8        3       0       0        0      0
Yutang-ETH commented 7 months ago

Wow, what a fast reply @shenwei356. Thank you very much.

I know I am asking too much, but it would be great to also support -L just like -N so that we can calculate -L 50, 90. What do you think? I really appreciate your work!

Best wishes, Yutang