missuse / ragp

Filter plant hydroxyproline rich glycoproteins
MIT License
5 stars 4 forks source link

scan_ag output #1

Closed TS404 closed 6 years ago

TS404 commented 6 years ago

The scan_ag and predict_hyp outputs are really nice.

It would also be good if there was an output option with the same colnames as "get_hmm" to simply list the locations of the relevant Prolines.

missuse commented 6 years ago

Thank you.

Could you elaborate a bit on this?

Do you mean something like a tidy data frame:

scan_ag - one row per P matched or one row per regex match? From the perspective of plotting I trust it would be best if scan_ag had an output where each regex match was in one row with columns" id, start, end.

predict_hyp - a trimmed predictionelement where only predicted hydroxyprolines would be kept? columns: id, location.

I could add an argument tidy = TRUE/ FALSE to both functions which could provide such an output.

TS404 commented 6 years ago

Yes, that's the sort of thing. I think that tidy dataframes are the way to go. Something similar to this output:

  agregions  <- scan_ag(sequence = sequences,
                        id = names(sequences),
                        dim = 3,
                        div = 6,
                        type = "extended", simplify = FALSE)$locations
  agregions2 <- matrix(unlist(lapply(agregions,t)), ncol = 2, byrow = TRUE)
  agregcount <- unlist(lapply(agregions, nrow))

  agregions3 <- data.frame(y           = rep(1:(length(sequences)),agregcount),
                           id          = rep(names(sequences),agregcount),
                           align_start = agregions2[,1],
                           align_end   = agregions2[,2])
missuse commented 6 years ago
scan_ag(sequence = at_nsp$sequence[c(1, 3, 16, 23)],
        id = at_nsp$Transcript.id[c(1, 3, 16, 23)],
        simplify = FALSE,
        tidy = TRUE)[,-1] #to omit the sequence column from showing here

output:


           id location.start location.end        P_pos length AG_aa
1 ATCG00660.1             NA           NA           NA     NA    NA
2 AT2G28410.1             26           41   27, 35, 40     16     8
3 AT2G28410.1             55           70   55, 67, 69     16     6
4 AT2G43620.1             62           76 63, 70, ....     15     8
5 AT2G43620.1            167          185 168, 176....     19     8
6 AT2G30933.1             36           51 37, 42, ....     16     9
7 AT2G30933.1             63           78 64, 68, ....     16     8

P_pos is a list column with P positions in the appropriate regex matches. length is the length of the matched substring. AG_aa is the number of amino acids in that were identified in the matched substring.

No information is lost compared to the list output.

What do you think?

TS404 commented 6 years ago

Love it!

missuse commented 6 years ago

Added tidy output. Currently it is available when simplify = FALSE and tidy = TRUE in function call.