related-sciences / nxontology-ml

Machine learning to classify ontology nodes
Apache License 2.0
6 stars 0 forks source link

Add simple Catboost model #7

Closed yonromai closed 1 year ago

yonromai commented 1 year ago

Adds a trivial use of the Catboost model to predict the rs_classification label (from efo_otar_slim_v3.43.0_rs_classification.tsv).

Note: The code isn't intended to be especially useful as of now. Rather, it is meant to serve as a first baseline & anchor for future feature and model developments.


Example of run:

Learning rate set to 0.091517
0:      learn: 1.2655947        total: 74.3ms   remaining: 1m 14s
100:    learn: 0.4192182        total: 1.04s    remaining: 9.3s
200:    learn: 0.3914375        total: 2.02s    remaining: 8.03s
300:    learn: 0.3724496        total: 3s       remaining: 6.98s
400:    learn: 0.3568311        total: 4.1s     remaining: 6.13s
500:    learn: 0.3437438        total: 5.1s     remaining: 5.08s
600:    learn: 0.3304904        total: 6.09s    remaining: 4.04s
700:    learn: 0.3194607        total: 7.26s    remaining: 3.1s
800:    learn: 0.3095234        total: 8.24s    remaining: 2.05s
900:    learn: 0.2992087        total: 9.22s    remaining: 1.01s
999:    learn: 0.2912932        total: 10.2s    remaining: 0us
> Feature importance:
                     Feature Id  Importances
0                        prefix    19.202524
1                       n_roots     8.639185
2                         depth     7.401090
3                   n_ancestors     7.091980
4          intrinsic_ic_sanchez     4.912018
5         xref__orphanet__count     3.976000
6             xref__omim__count     3.860123
7            xref__mondo__count     3.537002
8             xref__doid__count     3.456866
9                  intrinsic_ic     3.426787
10                is_gwas_trait     3.368879
11  intrinsic_ic_sanchez_scaled     3.051188
12            xref__mesh__count     3.047183
13            xref__ncit__count     2.794949
14                    n_parents     2.678225
15                   n_children     2.488526
16           xref__icd10__count     2.159789
17            xref__umls__count     2.104636
18                n_descendants     2.049742
19            xref__gard__count     2.004128
20        xref__snomedct__count     1.991120
21          xref__meddra__count     1.937532
22          intrinsic_ic_scaled     1.443108
23            xref__icd9__count     1.251675
24          xref__omimps__count     1.146293
25                     n_leaves     0.979452

> Classification report:
                    precision    recall  f1-score   support

01-disease-subtype       0.80      0.85      0.82      1123
   02-disease-root       0.72      0.64      0.68       776
   03-disease-area       0.77      0.79      0.78       242
    04-non-disease       0.97      0.99      0.98       918

          accuracy                           0.83      3059
         macro avg       0.82      0.82      0.82      3059
      weighted avg       0.83      0.83      0.83      3059
yonromai commented 1 year ago

@dhimmel Updated the PR with the csv and StrEnum changes 🙏