matsuken92 / molecular

0 stars 0 forks source link

Solution draft #13

Open matsuken92 opened 5 years ago

matsuken92 commented 5 years ago
First of all, we would like to thank everyone who participated in this competition, host menbers and Kaggle Team! I'm happy to finish this competition with really stable LB, no shake down :D 

■ Modeling part
Mainly our team used LightGBM with various features (around 500), and adopt NN model (MPNN) customizing based on @hengck23 's MPNN model for bringing model diversity to enhance the blending. Also we understood the seed averaging is very useful, so utilized large number of seed average.

In detail, please see the following slide.

■Features part
We struggled to find the features describing the global/local environment of atoms. We have made a lot of features by using chemical libraries (Openbabel, Dscribe, RDkit).  These libraries are very helpful because we have little domain knowledge at the starting point.

Some effective features are below.

- distance 
Some distances are essential features. We experimented with many distance between
 - index0 and index1 atoms
 - index0/1 and atom in α,β,γ-substituents
 - index0/1 and atom of each type('H','C','N','O')
 - nearest neighbors of index0/1

 We also use features from this brilliant notebook(https://www.kaggle.com/criskiev/distance-is-all-you-need-lb-1-481) by @criskiev

- Angle
Angle feaures also have very strong effects.  We tried making a lot of hand-crafted features.
 - bond angle
 - plain angle
 - dihedral angle with index0/1 
 - stats of dihedral angle with index0, 3-atoms-away from index0

- atom type
 We use atom type('H','C','N','O') as features instead of their electronegativity. Using atom type not only of index0/1 but of some substituents and neibor atoms,  we tried to describe the structures around the atoms involved in the coupling.

- partial charge
 partial charge features played important role. We mainly made them using `GetPartialCharges` in Openbabel. Some important features are below.
 - partial charge valus itself of idnex0/1
 - the diff/ratio of partail charge of index0/1

- characteristics of substituents
 According to this document (https://www.ucl.ac.uk/nmr/NMR_lecture_notes/L3_3_97_web.pdf),  it seems effective to make features describing α,β,γ-substituents.

- the hybridization of atoms
 This feature are also from Openbabel.

- ACSF features
 This local descriptor works well. This features derive from the chemical library Dscribe (https://singroup.github.io/dscribe/tutorials/acsf.html) We didn't tune g2,4_params because we didn't have time enough. So, this features may have more room of improvement.

- Fingerprint
 To describe the characteristics of the molecules, chemists seems to molecular fingerprints. We made some featrues using Morgan Fingerprint(very huge bits), MACCS Fingerprint(167 bits) They are global descriptors, and it works a little.

- topological data analysis
Maximum and minimum radius of persistent homology with [ripser](https://ripser.scikit-tda.org/).

- Bond ring feature
Circle size of bond ring created wiith networkx.

- PCA feature
Explained variance PCA feature calculated based on xyz position, aming to represent global shape of a molecular.
matsuken92 commented 5 years ago
スクリーンショット 2019-08-29 9 37 19
matsuken92 commented 5 years ago

solution_v001.pptx

matsuken92 commented 5 years ago

importance_fc_all.xlsx

matsuken92 commented 5 years ago

importance_v003_104_501.zip use_cols.csv.zip