too stringent outlier removal

kfuku52 commented 3 years ago

@Hego-CCTB I remember that I sent suggestions to you regarding this issue. Could you give me some updates?

Hego-CCTB commented 3 years ago

I haven't progressed on this yet. The original idea was to keep at least one of the samples in the case of one tissue being completely removed due to low within tissue correlation. The most challenging problem with this is: Which sample is the one being kept?

One Idea would be to keep the one that has the lowest between-tissue correlation, but this could lead to cases, where a complete outlier sample to be kept to represent a whole tissue.

So another approach would be to compare the samples cross-species and keep the ones closest to the same tissue in a different species.

Similar to this, one could look at orthologs of specific tissue marker genes, which would have to be defined on a case by case basis.

Or, instead of specific marker genes, one could look at the behaviour of genes from single-copy-orthogroups.

I think the last three options make the most sense. They are also the ones most complicated to integrate into curate. For this, another sub-utility, like cstmm is probably necessary.

kfuku52 commented 3 years ago

Could you tell me how the stepwise removal of each sample, rather than each BioProject, worked? To me, this is one of the easiest, yet potentially most effective solutions among what we've discussed so far.

Hego-CCTB commented 3 years ago

I've been investigating check_within_tissue_correlation() and I believe the function does not actually remove bioproject-wise, but step-wise by default. Let me explain what I think happens:

This block of the function is where the exclusion of a full bioproject should happen. This is right after identifying runs, which should be removed due to low tissue-correlation (those runs being stored in exclude_runs):

    if (length(exclude_runs)) {
    exclude_run_bps = sra2[(sra2$run %in% exclude_runs),c('bioproject','run','num_other_run_same_bp_tissue')]
    exclude_bp_counts = data.frame(table(exclude_run_bps$bioproject))
    exclude_run_bps = merge(exclude_run_bps, exclude_bp_counts, by.x='bioproject', by.y='Var1')
    exclude_run_bps = exclude_run_bps[order(exclude_run_bps$num_other_run_same_bp_tissue, 
    exclude_run_bps$Freq),]
    rownames(exclude_run_bps) = 1:nrow(exclude_run_bps)
    min_other_run_same_bp_tissue = exclude_run_bps[1,'num_other_run_same_bp_tissue']
    semimin_bp_count = exclude_run_bps[1,'Freq']
    cat('minimum number of other BioProjects within tissue:', min_other_run_same_bp_tissue, '\n')
    cat('semi-minimum count of exclusion-candidate BioProjects:', semimin_bp_count, '\n')
    conditions = (exclude_run_bps$Freq==semimin_bp_count)&. 
    (exclude_run_bps$num_other_run_same_bp_tissue==min_other_run_same_bp_tissue)
    exclude_bps = unique(exclude_run_bps[conditions,'bioproject'])
    exclude_runs = exclude_run_bps[(exclude_run_bps$bioproject %in% exclude_bps),'run']
    }

Afterwards, every run in exclude_runs will be printed and then removed from the dataset. No further changes to exclude_runs happen after the block above.

In particular, look at the first and last line of this block: exclude_run_bps = sra2[(sra2$run %in% exclude_runs),c('bioproject','run','num_other_run_same_bp_tissue')] ... exclude_runs = exclude_run_bps[(exclude_run_bps$bioproject %in% exclude_bps),'run']

the first line extracts bioproject info from the excluded runs and looks like this, for example:

bioproject	run	num_other_run_same_bp_tissue
PRJNA532463	SRR894712311111	1

In other words, exclude_run_bps only contains runs, which would be removed because of low tissue-correlation anyways. In the last line, we only take runs from that dataframe into exclude_runs. Because exclude_run_bps can't contain other runs belongig to this bioproject (except if other runs were excluded due to low tissue-correlation anyways), we can't remove bioproject-wise.

I made a small dataset, which has 7 runs over 3 tissues (4 flowers, 2 leafs, 1 root) and is guaranteed to reject one leaf sample to test this. The 2 leaves and the 1 root sample all belong to the same bioproject. 1 of the leaf samples is guaranteed to be removed (the expression data is just a copy of the root sample). After running curate, only one of the leaf samples is removed, even though both belong to the same bioproject.

I hope any of this makes sense. Am I looking at the correct place here? Am I interpreting the code correctly?

kfuku52 commented 3 years ago

Thank you for clarifying. I had a misunderstanding indeed. I thought all Runs are removed, but it's actually not. Sorry for the confusion.

Because exclude_run_bps can't contain other runs belongig to this bioproject (except if other runs were excluded due to low tissue-correlation anyways), we can't remove bioproject-wise.

But this seems to be not quite right, because it's removing multiple bad samples at once in a BioProject-wise manner here. exclude_runs = exclude_run_bps[(exclude_run_bps$bioproject %in% exclude_bps),'run']

Please consider the case where a BioProject contains two Runs: a terribly bad sample and a slightly bad sample. Both are categorized as exclude_runs in the first SVA. In the current implementation with the BioProject-wise removal, both removed and the curation is done. But if we change the behavior to the truly run-wise removal, only the terribly bad run is removed in the first SVA, leaving a chance for the slightly bad sample to survive in the slightly improved second-round SVA. This is what we expect here, at a cost of computational time.

Hego-CCTB commented 3 years ago

A "true" one outlier per same bioproject or same tissue option is now part of curate. See also: https://github.com/kfuku52/amalgkit/issues/49

kfuku52 / amalgkit

too stringent outlier removal #26