ideaconsult / jna-inchi

Wrapper to access InChI from Java
GNU Lesser General Public License v2.1
0 stars 0 forks source link

assess effort for adding support for MDL CTAB V3000 #15

Closed uli-f closed 1 year ago

uli-f commented 1 year ago

Some of the reaction files we process are in V3000 format.

@ntk73 Could you give us a rough estimate how much work / cost would be involved to add support for V3000? Thank you.

ntk73 commented 1 year ago

Uli, some clarification is needed. What exactly do you mean by "adding support for MDL CTAB V3000"?

Some additional thoughts from my side: When converting RInChI to reaction, RInChI library returns reactions in V2000 format. So far, I have not tested yet an input to native code in V3000 format and I suspect that V3000 is not supported (at least nothing about v3000 is mentioned in the official documentation). Could you provide an example RXN/RDFile file in V3000 format?

Also, if we have a chemical reaction object either in CDK library data model or in jna-rinchi data model, then V3000 is not needed to generate RInChI linear notation.

Also, if it appears after testing that RInChI native code can read MDL RXN V3000, then FileTextUtils class could be used directly to generate a RInChI string from such a file. The only reasonable use case for "V3000 support", I can imagine currently, is the need to be able to read a reaction in V3000 format, represent it internally as a jna-rinchi data object and then send it to native code in V2000 format.

By the way, currently CDK could read a structure in V3000 and V2000, but RXN (reaction) is supported in V2000 only

uli-f commented 1 year ago

Thank you for your insight. I'd try to clarify in the following.

When converting RInChI to reaction, RInChI library returns reactions in V2000 format. So far, I have not tested yet an input to native code in V3000 format and I suspect that V3000 is not supported (at least nothing about v3000 is mentioned in the official documentation).

Unfortunately, that is correct! I saw MOL V3000 files being part of the test set in the RInChI repo. So I made the assumption that RXN V3000 is also supported - well, that is not the case. I raised an issue with them to get an idea if that might be on their agenda (see IUPAC-InChI/RInChI#15).

Could you provide an example RXN/RDFile file in V3000 format?

Yep, here you go: v3000-marvin.zip

Also, if we have a chemical reaction object either in CDK library data model or in jna-rinchi data model, then V3000 is not needed to generate RInChI linear notation.

Agreed. My motivation comes from using jna-rinchi directly for batch-processing a huge amount of reactions in RXN V3000. But yes, the same can be achieved if CDK is able to read reactions in RXN V3000 and then kicking off the RInChI calculation by using API calls in CDK to calculate RInChI.

Also, if it appears after testing that RInChI native code can read MDL RXN V3000, then FileTextUtils class could be used directly to generate a RInChI string from such a file.

Unfortunately, RInChI native code cannot read MDL RXN V3000 at the moment. But it might be able in the future 🤞🏼

The only reasonable use case for "V3000 support", I can imagine currently, is the need to be able to read a reaction in V3000 format, represent it internally as a jna-rinchi data object and then send it to native code in V2000 format.

Agreed. But then it seems way easier to just harness the V3000 reading capabilities of CDK instead of re-inventing that wheel again...

By the way, currently CDK could read a structure in V3000 and V2000, but RXN (reaction) is supported in V2000 only

To the best of my knowledge CDK has support for reading RXN V3000: https://github.com/cdk/cdk/tree/main/storage/ctab/src/main/java/org/openscience/cdk/io If it comes to writing it only has support for RXN V2000, and I am pretty sure there is no support for RDF.

uli-f commented 1 year ago

To summarize:

There is one action item left to do: @ntk73 I added some javadoc and inline comments to make it very clear that at the moment only RXN V3000 and RD V3000 are supported. Would you please merge PR #18? Thank you.

I am happy for you to close this issue once PR #18 is merged.

ntk73 commented 1 year ago

I have merged PR #18. I will do small corrections on top of it: You use term "RDF" file format but RDF is not the same as RDFile. RDF file format is triple store for data serialization (... we use it often in our projects)

ntk73 commented 1 year ago

Corrections done! See commit c01a8f431f4c488227758649b868263a6215db69

uli-f commented 1 year ago

Thanks, https://github.com/ideaconsult/jna-inchi/commit/c01a8f431f4c488227758649b868263a6215db69 looks good.

I am closing this now.