samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
640 stars 174 forks source link

Adding Ultima Genomics to the list of allowed SAM platforms #648

Closed ilyasoifer closed 2 years ago

ilyasoifer commented 2 years ago

Hello, We would like to add the Ultima Genomics platform to the list of allowed platforms for the PL tag in SAM. Reference here. I would appreciate an advice on how to best approach this

Thank you very much in advance

Ilya Soifer (ilyasoifer) Director, Bioinformatics Ultima Genomics

jmarshall commented 2 years ago

Raising an issue here is the right way to get the process started. The last time we added to this list can be seen in #178 and PR #454, which have more background on what I'll summarise below — but there's also quite a lot of discussion there that goes off into the weeds so I don't recommend reading all of it!

The current PL list is

CAPILLARY, DNBSEQ (MGI/BGI), HELICOS, ILLUMINA, IONTORRENT, LS454, ONT (Oxford Nanopore), PACBIO (Pacific Biosciences), and SOLID.

As these have accreted piecemeal over the years, there is a mixture of names of manufacturers and of the underlying technologies. As discussed on the PR mentioned, we've moved to preferring to have the names of the technologies or platforms in the PL list. The idea is that the reason for tools to consult this field is to make inferences about the error model etc, which is a property of the platform rather than of the manufacturer. (In some cases, the two are the same; in other cases, the manufacturer may have changed names several times already or there may be a different name for the platform in common use.)

As Ultima Genomics has only decloaked today :smile: you are best placed to suggest an appropriate keyword to add to the list. Probably ULTIMA or UG? Or something else that better represents your mostly natural sequencing-by-synthesis technology?

jkbonfield commented 2 years ago

With @yfarjoun in the author list on the paper I would expect nothing less than following the correct procedures for specification updates, but a big thank you too for getting this in early. I'd be happy with ULTIMA too. If the company moves on to another unrelated technology, then it can present a new more-specific name.

As an aside, you may also wish to fill out PM (platform model) too as it can be useful for data consumers to distinguish between the different updates as the instrument progresses. Unlike PL this is not a controlled vocabulary, so you can decide what granularity of information you deem appropriate there and it doesn't need a specification update.

ilyasoifer commented 2 years ago

@jkbonfield, @jmarshall - thanks for the help. I think we will go with the UG keyword. What happens next? Should I just updated the Spec tex document? And thanks @yfarjoun for showing us the light

jkbonfield commented 2 years ago

You can make your own branch, edit & commit the doc, and then make a PR from that for review here.

Thanks.

jmarshall commented 2 years ago

To be honest, the actual change is trivial and we're happy to make it ourselves (add “UG (Ultima Genomics)” alphabetically; add an entry to Appendix B). And hence ideally avoid splitting the discussion between this issue and an eventual PR.

What would be more useful IMHO would be a little more background:

The process from here is that we will discuss it at our next meeting (June 21st) and most likely approve the addition (if consensus has not already been achieved prior to that, which it may well have). You or others from Ultima Genomics would be welcome to attend that Zoom meeting if you wish.

[^1]: Personally I have a mild preference for more spelt-out keywords — e.g. amongst the existing keywords, ILLUMINA's styling rather than ONT's — as they are more mnemonic for those who are not specialists in the particular platform. But this is a comparatively minor consideration.

[^2]: e.g. from TechCrunch: “Ultima says that its sequencing machine and software platform, the UG 100, can perform a […]”

ilyasoifer commented 2 years ago

Hi @jkbonfield, @jmarshall , Sounds good, I would appreciate you making that change. Now regarding your questions:

  1. The motivation behind the UG is obviously mostly to shorten the spelling, but maybe it is not that important. How about ULTIMAGEN or ULTIMA? I just feel that ULTIMAGENOMICS takes a lot of space. However, I leave it for you to decide.

  2. Here is the list of datasets that are making use of our data:

    • scRNAseq (published datasets)
      • SRR18145555
      • GSM6190599
    • Whole genome sequencing data from here in terra workspace, also available from Ultima Genomics website In addition, several other pre-prints will be published soon and will make the data publicly available
    • Whole genome methylation

The instrument is expected to be commercially available next year, but already now there are several early access sites generating data.

I am happy to join the discussion on the 21st if you feel it will be useful, I am just worried that since the bioinformatics team of Ultima is in Israel, it might be tricky to schedule, so I will leave it to your judgement

Thanks a lot - let me know if you need more information! Ilya

jkbonfield commented 2 years ago

I think we have all the information we need to make a decision, but you're obviously welcome to join us if you wish (any time, not just for this issue).

I wouldn't worry about length too much as the header only appears once and it's insignificant in size. We'll discuss it at the meeting, but agreed that a full ULTIMAGENOMICS is unnecessarily unwieldy.

Thanks also for the data. I found it after the AGBT conference and have been playing with some already, with an eye to how best to compress it. Are you considering evaluating any lossy-encoding of quality values? Maybe smoothing ("P-block"?), or binning/quantisation? It could also be done outside of homopolymers only if that's a key factor. Quality is a significant proportion of the data footprint, and so someone is likely to do research on this at some point. I feel it's best done by the people who understand the error models the most; ie UG maybe in collaboration with a variant caller team. This is a different issue though so if you wish to discuss it then maybe best to jkb email at sanger.ac.uk.

yfarjoun commented 2 years ago

I added it in a PR #662. @ilyasoifer do you need any other tags? I think you are already using FO for flow order, but I just thought I'd ask to be sure.

ilyasoifer commented 2 years ago

Thanks! I am sorry that I missed the notification. @yfarjoun - this is all we need I think