veitveit / PhosFake

0 stars 0 forks source link

modifyable residues parameter #5

Open mlocardpaulet opened 1 month ago

mlocardpaulet commented 1 month ago

I am not sure what should be the PTM-related parameters: variables? vectors? lists?

for ModifiableResidues and ModifiableResiduesDistr, I think that these should be vectors but I am not entirely sure. And I would suggest to add in the desription that these should be vectors that match. If the first is c("S", "T", "Y") then the second should be c(0.86, 0.13, 0.01) with values of the second matching the residues in the first? (I suspect that the same can be said for UserInputFoldChanges_NumRegProteoforms and UserInputFoldChanges_RegulationFC. )

And then, I am not sure how to format multiple entries. I have tried this:

ModifiableResidues:
    group: "Ground Truth Data"
    type: "paramGroundTruth"
    description: "Residues that can be modified for each PTM type."
    min: NA
    max: NA
    default: c("S", "T", "Y")

but I don't think that it is parsed correctly. Here is what I have when I run it:

> # Generate default parameters
> Param <- def_param('../data/input/parameters.yaml')
--------------------
Ground truth generation parameters:
 Param$paramGroundTruth = list 13 (2480 bytes)
.  NumCond = integer 1= 9
.  NumReps = integer 1= 3
.  PathToFasta = character 1= uniprotkb_homo_sapiens 
.  PathToProteinList = logical 1= NA
.  FracModProt = double 1= 0.5
.  FracModPerProt = double 1= 0.1
.  PTMTypes = character 1= ph 
.  PTMTypesDist = double 1= 0.5
.  PTMTypesMass = double 1= 79.966
.  PTMMultipleLambda = double 1= 0.1
.  ModifiableResidues = character 1= c("S", "T", "Y") 
.  ModifiableResiduesDistr = character 1= c(0.86, 0.13, 0.01) 
.  ...   and 1 more
--------------------
Proteoform abundance parameters:
 Param$paramProteoformAb = list 9 (1656 bytes)
.  QuantNoise = double 1= 0.3
.  DiffRegFrac = double 1= 0.1
.  DiffRegMax = integer 1= 10
.  UserInputFoldChanges_NumRegProteoforms = logical 1= NA
.  UserInputFoldChanges_RegulationFC = logical 1= NA
.  ThreshNAProteoform = integer 1= 0
.  AbsoluteQuanMean = integer 1= 7
.  AbsoluteQuanSD = double 1= 0.2
.  ThreshNAQuantileProt = integer 1= 0
--------------------
Digestion parameters:
 Param$paramDigest = list 9 (1664 bytes)
.  Enzyme = character 1= trypsin.strict 
.  PropMissedCleavages = double 1= 0.2
.  MaxNumMissedCleavages = integer 1= 3
.  PepMinLength = integer 1= 7
.  PepMaxLength = integer 1= 50
.  LeastAbundantLoss = integer 1= 0
.  EnrichmentLoss = double 1= 0.25
.  EnrichmentEfficiency = double 1= 0.35
.  EnrichmentNoise = double 1= 0.1
--------------------
MSRun parameters:
 Param$paramMSRun = list 7 (1200 bytes)
.  DetectabilityThreshold = double 1= 0.5
.  PercDetectedVal = integer 1= 1
.  WeightDetectVal = double 1= 0.1
.  MSNoise = double 1= 0.08
.  WrongIDs = integer 1= 0
.  WrongLocalizations = integer 1= 0
.  MaxNAPerPep = integer 1= 100
--------------------
Data analysis parameters:
 Param$paramDataAnalysis = list 3 (712 bytes)
.  ProtSummarization = character 1= medpolish 
.  MinUniquePep = integer 1= 100
.  StatPaired = logical 1= FALSE
--------------------
> # Run the simulations
> allBs <- run_sims(Param, phosfake_config)
Total number of simulations to run:  1 
#SAMPLE PREPARATION - Start

 + Importing data:
  - File /Users/locard/Documents/Projets_en_cours/2020_PhosFake/Analysis_for_paper/data/input/uniprotkb_homo_sapiens_AND_reviewed_tru_2024_08_12_canonical.fasta imported, containing 20435 protein sequences.
  - A total of 25 protein sequences have been removed due to unusual amino acids (B,J,O,U,X,Z) composition.
  - A total of 0 duplicated protein accessions have been removed.
  - Total number of remaining protein sequences: 20410 

 + Creating modified and unmodified fractions:
  - A total of 1 sequences are unmodifiable.
  - A total of 20409 sequences are modifiable, from which 50 % randomly selected to be modified.
  - Modified fraction: 10204 proteins.
  - Unmodified fraction: 10206 proteins.

 + Performing modification:
  - Selected modification type(s) "ph" with background frequency distribution of 50% respectively.
Error in parameters$ModifiableResiduesDistr[[i]] * 100 : 
  non-numeric argument to binary operator
mlocardpaulet commented 1 month ago

Here is what is in the Param list:

> Param
$paramGroundTruth
$paramGroundTruth$NumCond
[1] 9

$paramGroundTruth$NumReps
[1] 3

$paramGroundTruth$PathToFasta
[1] "uniprotkb_homo_sapiens_AND_reviewed_tru_2024_08_12_canonical.fasta"

$paramGroundTruth$PathToProteinList
[1] NA

$paramGroundTruth$FracModProt
[1] 0.5

$paramGroundTruth$FracModPerProt
[1] 0.1

$paramGroundTruth$PTMTypes
[1] "ph"

$paramGroundTruth$PTMTypesDist
[1] 0.5

$paramGroundTruth$PTMTypesMass
[1] 79.9663

$paramGroundTruth$PTMMultipleLambda
[1] 0.1

$paramGroundTruth$ModifiableResidues
[1] "c(\"S\", \"T\", \"Y\")"

$paramGroundTruth$ModifiableResiduesDistr
[1] "c(0.86, 0.13, 0.01)"

$paramGroundTruth$RemoveNonModFormFrac
[1] 0
mlocardpaulet commented 1 month ago

For the moment I can manually change it after (in the Param list).

veitveit commented 1 month ago

TODO: needs separate documentation for both in R and via yaml.