Add new RNAseq-based probe selection option

The issue

abagen currently offers a number of different methods for probe selection. Probe selection is required when there are several (>1) probes indexing the same gene; there exist several ways to either select a single "representative" probe, or collapse the data amongst the probes such that we obtain a single value, per brain region, for each gene (refer to the online documentation for more information).

One of the methods used—and recommended—by Arnatkevičiūte et al., 2019 is to select the probe that has the highest spatial correlation with RNAseq data for the corresponding gene. Unfortunately, abagen does not currently support this option (primarily since we don't [yet!] have a fetcher / loader for the RNAseq data; see #98).

Proposed solution

tl;dr Add a new probe_selection option: 'rnaseq'.

Selecting this option would fetch + load RNAseq data (for the two donors that have RNAseq data) and then, similar to the procedure for probe_selection='diff_stability', correlate the microarray expression for each probe across anatomical regions with the RNAseq expression for the corresponding gene. The microarray probe with the highest correlation to the RNAseq data would be selected for further consideration (above a certain threshold; below this threshold the probe would simply be removed).

Open question

Users are allowed to specify which donors they want to use via the donors option to abagen.get_expression_data(). Unfortunately, only two of the donors have RNAseq data, and it probably only makes sense to allow probe_selection='rnaseq' when at least one of those two donors are included in the call. This can be a pretty simple check at the beginning of the workflow that raises a ValueError if the required donors aren't provided.

Unfortunately, this leads to a somewhat more complicated issue: beyond fetching the relevant files at the beginning of the main workflow, we're not maintaining any information about the donors being used. This becomes necessary when we get to the actual probe selection stage, because we only want to be correlating microarray and RNAseq data from the same donors. That is, even if a user wants to generate microarray expression data with all six donors, we would only use the two donors with RNAseq data to perform the probe selection. Critically, this requires us to be able to match microarray data with the RNAseq data, but beyond attempting to parse filepaths this is not feasible in the current setup.

Potential fixes

Option one

Modify the dictionary that is returned by fetch_microarray() (and related functions) such that it is a dictionary of the form:

{
    "donorXXXX": {
        "microarray": "/path/to/MicroarrayExpression.csv",
        "annotation": "/path/to/Annotation.csv",
        "probes": "/path/to/Probes.csv",
        "ontology": "/path/to/Ontology.csv",
        "pacall": "/path/to/PACall.csv"
    }
}

Instead of the current method (where the dictionary keys are "microarray," "annotation," etc. and the values are lists of filepaths).

The benefit of this method is that we would be able to use the keys of the dictionary (the donor IDs) to match the microarray and RNAseq data in the probe selection step. The drawback to this method is that a significant portion of the current codebase relies on this list-of-filepaths format, so modifying the structure would require refactoring a lot of code.

Option two

We could correlate ALL the donors' microarray expression data with the RNAseq data available from the two donors, and then average across these correlations (similar to the method for probe_selection='diff_stability'), picking the probe with the highest average correlation.

The benefit of this method is that we wouldn't have to take into account which data belongs to which donor. The downside of this method is that this is almost certainly not as accurate as only correlating microarray and RNAseq data within the same donor.

Option three

\ Very open to suggestions and discussion about solutions to this issue of donor matching for RNAseq-based probe selection!

Despite the fact that we rarely require donor-specific information beyond, essentially, this explicit case, I think I'm going to move forward with option one.

This is going to change a LOT of the internals, including most of the dataset fetchers (e.g., datasets.fetch_microarray(), datasets.fetch_raw_mri(), and datasets.fetch_rnaseq()) to nearly all of the functions that previously accepted lists of e.g., microarray dataframes (e.g., probes_.collapse_probes, samples_.aggregate_samples()). I think it would be worthwhile to develop a little utility function for going from the "new" dictionary structure to a more flattened list-like structure:

def get_subattr(dictionary, attribute):
    """
    Parameters
    ----------
    dictionary : dict
        Input dictionary as returned by e,g., :func:`abagen.fetch_microarray()`
        with donor IDs as keys and dictionaries of filepaths as values
    attribute : str
        Which attributes to extract from the donor-level dictionaries

    Returns
    -------
    subattr : list of str
        Values of the specified `attribute` from each of the donor-level 
        dictionaries in `dictionary`
    """

    for donor, files in dictionary.items():
        yield donor[attribute]

Might be overkill, but I think the yield functionality will generally work since we (almost) always iterate over the list of filepaths. I'll re-visit as I work through the codebase and implement these changes...

rmarkello / abagen