visze commented 2 years ago

Just see if there will be an improvement. This will not be part of the manuscript

visze commented 2 years ago

well they are a bit better:

new positives:

metric  value
AUROC   0.995
AUPRC   0.605

curves_prc_roc_global_mean_new_positives

old positives:

metric  value
AUROC   0.996
AUPRC   0.585

curves_prc_roc_global_mean

But I have to rerun everyting 100 times to see the average increase.

visze commented 2 years ago

After rerunning parsmurf 100 times with different seeds I get on hg38, with global means:

additional positives:

metric  mean    max min
AUROC   0.99501 0.996   0.995
AUPRC   0.59688 0.609   0.578

standard positives:

metric  mean    max min
AUROC   0.9959  0.996   0.995
AUPRC   0.58186 0.598   0.563

So we get a slight increase in AUPRC of 0.01502 and a small decrease of AUROC of -0.00089

I don't think it is worth to include the new data, because of the small increase (and we tuning on imbalance, so better AUPRC is somehow expected). In theory a different test set (not crossvalidation) is needed to really show if this helps.

Using global means for some features works much better than new positives (see #12)

visze commented 2 years ago

redo the same for feature set of remm v1.4 and with both genome releases

visze commented 2 years ago

Results

ReMM v1.4 hg38

For 100 repetitions with random seeds

positives	Metric	Mean	Max	Min
standard	AUPRC	0.599	0.615	0.584
	AUROC	0.996	0.996	0.995
additional	AUPRC	0.609	0.62	0.595
	AUROC	0.995	0.996	0.995

ReMM v1.4 hg19

TODO

kircherlab / ReMM

Check the new positive data #6

Results

ReMM v1.4 hg38

ReMM v1.4 hg19