nglviewer / ngl

WebGL protein viewer
http://nglviewer.org/ngl/
MIT License
657 stars 168 forks source link

1015 binary cif support #1040

Closed papillot closed 1 month ago

papillot commented 2 months ago

This PR adds support for Binary Cif files parsing and changes the RCSB data source provider to use this new format instead of the deprecated MMTF format.

Changes made

Fixes

Comments

Small benchmark, using the pdb 5z6y (relatively small strucutre GFP):

Format Provider Download size
mmtf RCSB 17.2kb*
bcif PDBe 182kb
bcif RCSB 33.4kb*

(*) RCSB response is gzipped

Despite the claim that bcif achieves better compression, it seems that there are still some caveats and generally speaking forcing the transition from bcif to mmtf creates regressions (also some improvements for specific use cases where the extra data content is relevant)

fredludlow commented 2 months ago

Thanks again - will have a proper look asap, just to note some of our examples already failing due to this, e.g. https://nglviewer.org/ngl/?script=showcase/viruses

EDIT: To clarify, failing because they can't grab the mmtf file, not because of changes in this PR!

fredludlow commented 2 months ago

Hmm, 3nap seems to be causing me some problems (might be a horror-show example as it's a virus) in the symmetry processing

image

papillot commented 2 months ago

I did not implement the alphaCarbondsOnly flag, to this might be it. I'll look at the issue with the mmtf file more precisely. The code was allowing to download backbone only structures. I haven't looked at wether the same exist for bcif files

papillot commented 2 months ago

Fixed:

image

(it seems the bug was already there in the previous cif parser)

fredludlow commented 1 month ago

Okay - looks good, I've gone through all the parser examples and they all work except for:

I can dig some more into these but thought I'd flag first as it might be something simple when you're familiar with the code.

fredludlow commented 1 month ago

Apologies, first one that wasn't working is parser/map (edited above, previously said ccp4, but adding this comment in case you're following by email too)

fredludlow commented 1 month ago

Oh, and you should make yourself the authore for pdbe-datasource.ts!

papillot commented 1 month ago

Good catch @fredludlow !

The issue with the 4UJD.cif.gz file was due to how compression is handled. The streamer returns an ArrayBuffer, which needs to be converted to a string to be processed by the CIF parsing library.

The second one is a bug with handling altlocs (they were not processed correctly in fact)

Both were pretty major issues. Maybe we should add more tests to better cover this code?

fredludlow commented 1 month ago

Can confirm both those are now working for me.

I've got a local PDB mirror and am running a script to try NGL.autoLoad on every mmCIF formatted entry - if this works there may still be other classes of bug, but it would definitely be reassuring.

Happy for you to merge this in the meantime (and thank you again!)

fredludlow commented 1 month ago

Hmm, 7a4p is causing issues

papillot commented 1 month ago

That's a tricky file: one of the chain (identifier U, entity id 20) is missing from the coordinates block. It is reported elsewhere as a 3 aa chain. So that's a missing null check I think.

fredludlow commented 1 month ago

7a4p was the ony one that threw an error / rejected the promise. There were approx 250 entries where the spacegroup was either undefined or another one that isn't recognized (P b c a, P 21 21 2 A, P 1 21/n 1, P n n a, C 4 21 2 and F 4 2 2 - putting here in case this comes up in the future) but I don't think that's related to the parser.

For reference, script is here: https://gist.github.com/fredludlow/e0a2a4af29d902350c872162315538d1

ppillot commented 1 month ago

Thanks @fredludlow that's so useful!

panda-byte commented 4 weeks ago

I'm not sure if the data source is set up properly for this. The following doesn't work for me (on a development server):

new Stage(...).loadFile('rscb://5z6y');

This tries to access http://models.rcsb.org/5z6y.bcif.gz, but that returns a 301 Moved Permanently, referring to the new location https://models.rcsb.org/5z6y.bcif.gz (using HTTPS!), which works, when used explicitly in the code. On the other hand, shouldn't the API described by RCSB be utilized instead? However, there seems to be no option for compression. The https://models.rcsb.org/5z6y.bcif.gz API seems to be the best option after all, even though it doesn't offer any options (I think that's just the download link on their website, right?).

papillot commented 4 weeks ago

I'm not sure if the data source is set up properly for this. The following doesn't work for me (on a development server):

new Stage(...).loadFile('rscb://5z6y');

This tries to access http://models.rcsb.org/5z6y.bcif.gz, but that returns a 301 Moved Permanently, referring to the new location https://models.rcsb.org/5z6y.bcif.gz (using HTTPS!), which works, when used explicitly in the code. On the other hand, shouldn't the API described by RCSB be utilized instead? However, there seems to be no option for compression. The https://models.rcsb.org/5z6y.bcif.gz API seems to be the best option after all, even though it doesn't offer any options (I think that's just the download link on their website, right?).

The protocol part (http:// vs https://) comes from the current location (i.e. the server that serves the current page, her your development server). We should make this always https then (I think it's already the case for PDBe). Regarding the compression, are you referring to compression headers? Not sure they are necessary as the file is already transmitted as a compressed stream?

panda-byte commented 4 weeks ago

Oh, I see. I think that would be good.

Regarding the compression: I was referring to the RCSB model API, which offers several endpoints to make requests to. I was wondering if instead of using links like https://models.rcsb.org/5z6y.bcif.gz directly, maybe this API should be queried instead (like the full endpoint), as it seems to be the "official", documented way. But it apparently doesn't offer any compression options, so maybe using the direct download link is still better.

papillot commented 4 weeks ago

So, I've just checked and the https://models.rcsb.org endpoint does a good job with sending the data stream compressed using the HTTP headers:

HTTP headers

Here is the download size compared with the resource size

Screenshot 2024-06-05 at 12 11 30

I'll make a fix for the https vs http

ppillot commented 1 week ago

Oh, I see. I think that would be good.

Regarding the compression: I was referring to the RCSB model API, which offers several endpoints to make requests to. I was wondering if instead of using links like https://models.rcsb.org/5z6y.bcif.gz directly, maybe this API should be queried instead (like the full endpoint), as it seems to be the "official", documented way. But it apparently doesn't offer any compression options, so maybe using the direct download link is still better.

@panda-byte #1043 has been merged and published as v2.3.1 with the http/https fix for rcsb