Import reading functions from scp - see issue 199 for details.

lgatto commented 9 months ago

I've addressed all comments and changes. Any idea what will break once this is merged? What about vignettes in SCP.replication?

cvanderaa commented 9 months ago

After careful inspection of the new code I would like to rediscuss the some of the arguments (colAnnot, ecols, channelCol and batchCol). Currently, the documentation describes 2 cases for readQFeatures():

The single-set case

The user will use the following signature if they provide ecols (backward compatibility):

c("readQFeatures", c("data.frame", "missing"))

or better

c("readQFeatures", c("data.frame", "vector"))

Problem: users cannot provide a sample annotation table in the single-set case.

The multi-set case

The user will use the following signature

c("readQFeatures", c("data.frame", "data.frame"))

Problem: "lazy" users that don't have sample annotations are stuck. One argument could be that we promote good practice and enforce users to make a sample annotation table, but then, we must enforce it as well in the single-set case. Also, it sounds a bit confusing that colAnnot can refer to a table of sample annotations or a vector of column name/indices, which are different concepts to me.

My (bold) suggestion

colAnnot should always be a table (data.frame and friends), but the argument is optional.
Unify ecols (single-set case) and channelCol (multiset case) as a single argument (that I will tentatively call ecols2 for clarity).
- When ecols2 is a vector and colAnnot is missing: use it to select quantitative columns in assayData (same as ecols)
- When ecols2 is a vector and colAnnot is provided: use it to select quantitative columns in assayData (same as ecols)
- When ecols2 is a character(1) and colAnnot is provided: use it to select the quantitative column in assayData (same as ecols). If no match, use ecols2 to match a column in colAnnot that provide the names of the quantitative columns (same as channelCol)
- ecols2 can never be missing!
Use batchCol as follows:
- If batchCol is a vector (stopifnot(length(batchCol) == nrow(assayData))), use that vector to split the SE.
- If batchCol is a character(1) and colAnnot is provided, use it to retrieve the sample annotation column to use for splitting the SE.
- If batchCol is missing, don't split (=single-set case).

Conclusion

I think implementing this will enable more flexible use cases, without a loss of the current capabilities.

Single-set case: You can construct the QFeatures object with or without colData.
Multi-set case: You can read the data without the need for a colAnnot table. This could facilitate reading LFQ data where a channelCol makes no sense.

I will implement this, so to show my ideas in action, but I'm happy to already hear your thoughts on that.

cvanderaa commented 9 months ago

I've addressed all comments and changes. Any idea what will break once this is merged? What about vignettes in SCP.replication?

Since the behavior of the functionality did not change and there is backward compatibility, I don't think anything will break as long as we adapt readSCP() as a wrapper function as you suggested in #199

lgatto commented 9 months ago

Since the behavior of the functionality did not change and there is backward compatibility, I don't think anything will break as long as we adapt readSCP() as a wrapper function as you suggested in #199.

But some arguments have changed (for now at least), such as batchCol to runCol, which will lead to errors.

lgatto commented 9 months ago

There are some aspects I like, others that I'm unsure to understand. But here are a first set fo comments, and I'll think more about your suggestion.

Problem: users cannot provide a sample annotation table in the single-set case.

Yes, they can, by setting a colAnnotation that defines a single run/batch. See the 'Single-set case using a data.frame' example.

Also, it sounds a bit confusing that colAnnot can refer to a table of sample annotations or a vector of column name/indices, which are different concepts to me.

These are two different ways to indicate annotate columns, so not entirely different, but still confusing I agree. My thinking was to steer away from ecol and promote its replacement by a colAnnotation table.

Your suggestion is the opposite, i.e. to always use ecol, and only use colAnnotation table to add a colData to the final QFeatures.

In your case 2,

When ecols2 is a vector and colAnnot is provided: use it to select quantitative columns in assayData (same as ecols)

I assume here colAnnot is used to populate the colData of the returned QFeatures object?

lgatto commented 9 months ago

And what about the following, that simplifies you suggestion (I think). Here the user uses either a colAnnotation data.frame, or one or two arguments.

colAnnotation is missing

The user must always provide the ecol vector with variables names or indices or a logical to identify what columns are quantitative data. This defines de vertical splitting of the assay and the row data.
If only ecol of provided, and the second vector runCol is missing (or NULL), we are in the single-set scenario.
If runCol is also provided, it must be a numeric(1) or a character(1) that gives the column name that defines the horizontal splitting. The presence of runCol defines the multi-set case.

colAnnotation is present

When colAnnotation is present, ecol and runCol must either be missing (or both NULL). colAnnotation must be a data.frame or an object that can be coerced to a data.frame.
colAnnotation must at least contain a column names ecol that with the names of the samples, i.e. table column names used for vertical splitting of the data. This is equivalent of using the ecol argument, except it is passed in a data.frame.
colAnnotation can also contain a runCol column that defines the horizontal splitting of the input table.
If colAnnotation only contains ecol, we are in the single-set case, and nrow(colAnnotation) corresponds to the number of samples.
If colAnnotation contains ecol and runCol, we are in the multi-set case, and nrow(colAnnotation) == nrow(table). This situation requires a bit more work from the user, but that work would anyway have to be done to provide additional sample annotations (see next point). In addition, this is also something a GUI from scpGUI could help with.
Any other columns of colAnnotation will become colData variables.

Notes:

Here's a blog post that advises to rely on NULL instead of missing. Will try to dig it up.
And at this point, we could rename ecol to sampleCol, to match it with runCol.

cvanderaa commented 9 months ago

I really like your suggestion and how to present the problem as vertical and horizontal splitting.

I'm however skeptic with the following:

If colAnnotation contains ecol and runCol, we are in the multi-set case, and nrow(colAnnotation) == nrow(table). This situation requires a bit more work from the user, but that work would anyway have to be done to provide additional sample annotations (see next point). In addition, this is also something a GUI from scpGUI could help with.

For SCP data for instance, nrow(table) can exceed 1,000,000 lines meaning that the colAnnot lines will need to be duplicated throusands of times.

Instead, I would suggest to keep nrow(colAnnot) equal to the number of samples in both cases (ie with or without a runCol column). When colAnnot contains a runCol column, use that column to retrieve the run identifiers and (thanks to an efficient internal function), match the column in assayData that contains the same run identifiers. This will require a bit more functionality to perform the matching of columns, but the task should be trivial.

What do you think?

cvanderaa commented 9 months ago

And at this point, we could rename ecol to sampleCol, to match it with runCol.

Yes I prefer sampleCol, but what about quantCol? Again, no strong opinion on this, so I let you decide.

lgatto commented 9 months ago

I was wrong, nrow(colAnnotation) would need to be equal to number of samples time number of runs. Is that what you meant? In other words, in that case, colAnnotation would be the final colData.

lgatto commented 9 months ago

Yes I prefer sampleCol, but what about quantCol? Again, no strong opinion on this, so I let you decide.

Whichever we choose, it should be plural.

lgatto commented 9 months ago

Single-set case, multiplexed: requires quantCols only. Also LF with a re-ordered peptide/protein-level table (runs are missing in this case).

|------+------------+-----------|
| cols | Quant 1..N | more cols |
|      |            |           |
|      |            |           |
|      |            |           |
|------+------------+-----------|

Multi-set case, multiplexed: requires quantCols and runCol.

|-----+------+------------+-----------|
| Run | cols | Quant 1..N | more cols |
|   1 |      |            |           |
|   1 |      |            |           |
|-----+------+------------+-----------|
|   2 |      |            |           |
|-----+------+------------+-----------|

Multi-set case, LF: requires quantCols and runCol with a optional multiPlexCol (for plexDIA).

|-----+------+---------+-----------+-------|
| Run | cols | Quant 1 | more cols | multi |
|   1 |      |         |           |       |
|   1 |      |         |           |       |
|-----+------+---------+-----------+-------|
|   2 |      |         |           |       |
|-----+------+---------+-----------+-------|

Special case DIANN. A specialised function that parses the table to case 2.

Users can either use the arguments above or a colAnnotation data.frame (that will become the colData).

cvanderaa commented 9 months ago

I'll work tonight on refactoring readQFeatures() to accommodate for the new arguments and behaviour. I will provide the changes in a new PR.

rformassspectrometry / QFeatures