First of all, we would like to thank everyone who participated in this competition, host menbers and Kaggle Team! I'm happy to finish this competition with really stable LB, no shake down :D
■ Modeling part
Mainly our team used LightGBM with various features (around 500), and adopt NN model (MPNN) customizing based on @hengck23 's MPNN model for bringing model diversity to enhance the blending. Also we understood the seed averaging is very useful, so utilized large number of seed average.
In detail, please see the following slide.
■Features part
We struggled to find the features describing the global/local environment of atoms. We have made a lot of features by using chemical libraries (Openbabel, Dscribe, RDkit). These libraries are very helpful because we have little domain knowledge at the starting point.
Some effective features are below.
- distance
Some distances are essential features. We experimented with many distance between
- index0 and index1 atoms
- index0/1 and atom in α,β,γ-substituents
- index0/1 and atom of each type('H','C','N','O')
- nearest neighbors of index0/1
We also use features from this brilliant notebook(https://www.kaggle.com/criskiev/distance-is-all-you-need-lb-1-481) by @criskiev
- Angle
Angle feaures also have very strong effects. We tried making a lot of hand-crafted features.
- bond angle
- plain angle
- dihedral angle with index0/1
- stats of dihedral angle with index0, 3-atoms-away from index0
- atom type
We use atom type('H','C','N','O') as features instead of their electronegativity. Using atom type not only of index0/1 but of some substituents and neibor atoms, we tried to describe the structures around the atoms involved in the coupling.
- partial charge
partial charge features played important role. We mainly made them using `GetPartialCharges` in Openbabel. Some important features are below.
- partial charge valus itself of idnex0/1
- the diff/ratio of partail charge of index0/1
- characteristics of substituents
According to this document (https://www.ucl.ac.uk/nmr/NMR_lecture_notes/L3_3_97_web.pdf), it seems effective to make features describing α,β,γ-substituents.
- the hybridization of atoms
This feature are also from Openbabel.
- ACSF features
This local descriptor works well. This features derive from the chemical library Dscribe (https://singroup.github.io/dscribe/tutorials/acsf.html) We didn't tune g2,4_params because we didn't have time enough. So, this features may have more room of improvement.
- Fingerprint
To describe the characteristics of the molecules, chemists seems to molecular fingerprints. We made some featrues using Morgan Fingerprint(very huge bits), MACCS Fingerprint(167 bits) They are global descriptors, and it works a little.
- topological data analysis
Maximum and minimum radius of persistent homology with [ripser](https://ripser.scikit-tda.org/).
- Bond ring feature
Circle size of bond ring created wiith networkx.
- PCA feature
Explained variance PCA feature calculated based on xyz position, aming to represent global shape of a molecular.