microbiomedata / nmdc-metadata

Managing metadata and policy around metadata in NMDC
https://microbiomedata.github.io/nmdc-schema/
Other
2 stars 0 forks source link

N50 and L50 definitions are inverted #304

Open cmungall opened 3 years ago

cmungall commented 3 years ago

all the definitions for slots {ctg,scaf}_[N|L]\d+ need checked. It looks like N50 and L50 need inverted, as does N90 and L90, etc

chienchi commented 3 years ago

I think that definition is from BBtools stats. Brian believes the proper notation should be N50=number of contigs making 50% of the assembly and L50=length lower limit of contigs making up 50% of the assembly.

chienchi commented 3 years ago

@scanon @hubin-keio Do you have any comment? We use bbtools to get the assembly stats. However, bbtools defines the N50 and L50 is opposite to what Wikipedia says.

cmungall commented 3 years ago

just for reference this is what we have now. The body of the text for N50 references L50, and the body of the text for L50 references N50, so it looks like whatever the canonical answer we have an inversion on our part:

  scaf_N50:
    is_a: metagenome assembly parameter
    description: >-
      Given a set of scaffolds, each with its own length, the L50 count is defined as the smallest number of scaffolds whose length sum makes up half of genome size.
    range: float

  scaf_L50:
    is_a: metagenome assembly parameter
    description: >-
      Given a set of scaffolds, the N50 is defined as the sequence length of the shortest scaffold at 50% of the total genome length.
    range: float
chienchi commented 3 years ago

commit 6c4ed99