Closed sly1983 closed 1 year ago
Hi!
What version of phylotaR
are you using?
Did you install from CRAN, like this: install.packages("phylotaR")
or from GitHub like this devtools::install_packages("ropensci/phylotaR")
?
Updating to the latest version on GitHub might fix the issue for you. Try reinstalling the package, restarting R and running the code again.
Otherwise, given that you have such a large taxonomic group, 4747.
You could try to break down the clusters to a smaller subset to determine on which cluster or sequences the issue is arising.
You could drop a random subset of clusters, the largest clusters or the smallest clusters using the drop_clsters()
function.
Or you could break the clusters down by taxonomic group... e.g. to break the clusters down to an individual genus you could:
library(phylotaR)
data("bromeliads")
# fetch the genus id for every sequence in the cluster
genus_ids <- get_txids(phylota = bromeliads, sid = bromeliads@sids,
rnk = 'genus')
# in genus_ids, each element refers to an NCBI taxon ID for a genus
# names(genus_ids), will show the labelled name attached to each element
# these names refer to the sequence ID in the cluster
# identify the unique genera
unique_genus_ids <- unique(genus_ids)
# drop empty IDs
unique_genus_ids <- unique_genus_ids[unique_genus_ids != '']
# convert genus IDs to names (if needed)
scientific_names <- get_tx_slot(phylota = bromeliads, txid = unique_genus_ids,
slt_nm = 'scnm')
# choose a genus...
# Option 1: .... by name
bool <- "Ananas" == scientific_names
annas_id <- names(scientific_names)[bool]
bool <- genus_ids == annas_id
seq_ids_to_keep <- names(genus_ids)[bool]
# Option 2: .... by number of sequences
freq_table <- sort(table(genus_ids), decreasing = TRUE)
bool <- genus_ids == names(freq_table)[1]
seq_ids_to_keep <- names(genus_ids)[bool]
# shrink down clusters to just the sequences of the specific genus
genus_clusters <- bromeliads
for (cid in genus_clusters@cids) {
genus_clusters <- drop_sqs(phylota = genus_clusters, cid = cid,
sid = seq_ids_to_keep)
}
# drop all clusters with 0 seqs left
n_taxa <- get_ntaxa(phylota = genus_clusters, cid = genus_clusters@cids)
genus_clusters <- drop_clstrs(phylota = genus_clusters,
cid = genus_clusters@cids[n_taxa > 0])
# ^ try printing the genus_clusters to test the changes
Hi,
thanks a lot for the reply. Unfortunately after updating the version the issue is still there. Yes, the taxonomic group is huge. The second suggestion is working to check individual genera. But since there are hundreds of genera it is not really feasible to go through them one by one. Would there be a way to check this in a more automated fashion?
Thanks again for the help!
Hi,
I'm also getting the same error using drop_by_rank(). I've tried both proposals above without success.
I have no clue what may be causing this. Thank you so much for your assistance.
Hi,
I'm experiencing the very same error. I've installed the latest version from Github via remotes::install_github('ropensci/phylotaR')
. Is there any solution to overcome this error, apart from the options provided above?
Cheers Bastian
@bheimbu I cannot reproduce the error with GitHub version and txid <- 9504
. txid <- 4747
takes too much time to download, and it always is interrupted in my environment. What's the id are you working on? If you succeed in the run()
, could you share the zipped directory?
ncbi_dr = "/workspaces/blast/bin/"
devtools::load_all()
wd = "/workspaces/test2"
#unlink(wd)
# library(phylotaR)
# wd <- '/media/...
# ncbi_dr <- '/home/...
# txid <- 4747
txid <- 9504
setup(wd = wd, txid = txid, ncbi_dr = ncbi_dr, v=TRUE, overwrite = TRUE)
run(wd = wd)
all_clusters <- read_phylota(wd)
print(all_clusters)
cids <- all_clusters@cids
n_taxa <- get_ntaxa(phylota = all_clusters, cid = cids)
keep <- cids[n_taxa >= 5]
selected <- drop_clstrs(phylota = all_clusters, cid = keep)
reduced <- drop_by_rank(selected, rnk = 'species', n=1)
My log:
r$> # Test example from issue https://github.com/ropensci/phylotaR/issues/48
ncbi_dr = "/workspaces/blast/bin/"
r$> devtools::load_all()
ℹ Loading phylotaR
^[[200~wd = "/workspaces/test2"^[[201~
r$> wd = "/workspaces/test2"
r$> #unlink(wd)
# library(phylotaR)
# wd <- '/media/...
# ncbi_dr <- '/home/...
# txid <- 4747
txid <- 9504
r$> setup(wd = wd, txid = txid, ncbi_dr = ncbi_dr, v=TRUE)
-------------------------------------------------
phylotaR: Implementation of PhyLoTa in R [v1.3.0]
-------------------------------------------------
Checking for valid NCBI BLAST+ Tools ...
Found: [/workspaces/blast/bin/makeblastdb]
Found: [/workspaces/blast/bin/blastn]
. . Running makeblastdb
Setting up pipeline with the following parameters:
. blstn [/workspaces/blast/bin/blastn]
. btchsz [100]
. date [2023-01-03]
. db_only [FALSE]
. mdlthrs [3000]
. mkblstdb [/workspaces/blast/bin/makeblastdb]
. mncvrg [51]
. mnsql [250]
. multiple_ids [FALSE]
. mxevl [1e-10]
. mxnds [1e+05]
. mxrtry [100]
. mxsql [2000]
. mxsqs [50000]
. ncps [1]
. outfmt [6 qseqid sseqid pident length evalue qcovs qcovhsp]
. outsider [FALSE]
. srch_trm [NOT predicted[TI] NOT "whole genome shotgun"[TI] NOT unverified[TI] NOT "synthetic construct"[Organism] NOT refseq[filter] NOT TSA[Keyword]]
. txid [9504]
. v [TRUE]
. wd [/workspaces/test2]
. wt_tms [1, 3, 6 ...]
Error in cache_setup(ps = ps, ovrwrt = overwrite) :
Cache already exists, ovrwrt = FALSE.
r$> setup(wd = wd, txid = txid, ncbi_dr = ncbi_dr, v=TRUE, overwrite = TRUE)
-------------------------------------------------
phylotaR: Implementation of PhyLoTa in R [v1.3.0]
-------------------------------------------------
Checking for valid NCBI BLAST+ Tools ...
Found: [/workspaces/blast/bin/makeblastdb]
Found: [/workspaces/blast/bin/blastn]
. . Running makeblastdb
Setting up pipeline with the following parameters:
. blstn [/workspaces/blast/bin/blastn]
. btchsz [100]
. date [2023-01-03]
. db_only [FALSE]
. mdlthrs [3000]
. mkblstdb [/workspaces/blast/bin/makeblastdb]
. mncvrg [51]
. mnsql [250]
. multiple_ids [FALSE]
. mxevl [1e-10]
. mxnds [1e+05]
. mxrtry [100]
. mxsql [2000]
. mxsqs [50000]
. ncps [1]
. outfmt [6 qseqid sseqid pident length evalue qcovs qcovhsp]
. outsider [FALSE]
. srch_trm [NOT predicted[TI] NOT "whole genome shotgun"[TI] NOT unverified[TI] NOT "synthetic construct"[Organism] NOT refseq[filter] NOT TSA[Keyword]]
. txid [9504]
. v [TRUE]
. wd [/workspaces/test2]
. wt_tms [1, 3, 6 ...]
-------------------------------------------------
r$> run(wd = wd)
---------------------------------------------------
Running pipeline on [unix] at [2023-01-03 05:09:10]
---------------------------------------------------
Running stages: taxise, download, cluster, cluster2
--------------------------------------------
Starting stage TAXISE: [2023-01-03 05:09:10]
--------------------------------------------
Searching taxonomic IDs ...
Downloading taxonomic records ...
. [1-22]
Generating taxonomic dictionary ...
---------------------------------------------
Completed stage TAXISE: [2023-01-03 05:09:16]
---------------------------------------------
----------------------------------------------
Starting stage DOWNLOAD: [2023-01-03 05:09:16]
----------------------------------------------
Identifying suitable clades ...
Identified [1] suitable clades.
Downloading hierarchically ...
Working on parent [id 9504]: [1/1] ...
. + direct ...
. + by child ...
. . Working on child [id 2688256]
. . + whole subtree ...
. . . Downloading [34 sqs] ...
. . . . [1-34]
. . . Working on child [id 1263727]
. . . + whole subtree ...
. . . . Downloading [3 sqs] ...
. . . . . [1-3]
. . . . Working on child [id 361674]
. . . . + whole subtree ...
. . . . . Downloading [1 sqs] ...
. . . . . . [1-1]
. . . . . Working on child [id 292213]
. . . . . + whole subtree ...
. . . . . . Downloading [100 sqs] ...
. . . . . . . [1-100]
. . . . . . Working on child [id 57176]
. . . . . . + whole subtree ...
. . . . . . . Downloading [200 sqs] ...
. . . . . . . . [1-100]
. . . . . . . . [101-200]
. . . . . . . Working on child [id 57175]
. . . . . . . + whole subtree ...
. . . . . . . . Downloading [100 sqs] ...
. . . . . . . . . [1-100]
. . . . . . . . Working on child [id 43147]
. . . . . . . . + whole subtree ...
. . . . . . . . . Downloading [900 sqs] ...
. . . . . . . . . . [1-100]
. . . . . . . . . . [101-200]
. . . . . . . . . . [201-300]
. . . . . . . . . . [301-400]
. . . . . . . . . . [401-500]
. . . . . . . . . . [501-600]
. . . . . . . . . . [601-700]
. . . . . . . . . . [701-800]
. . . . . . . . . . [801-900]
. . . . . . . . . Working on child [id 37293]
. . . . . . . . . + whole subtree ...
. . . . . . . . . . Downloading [600 sqs] ...
. . . . . . . . . . . [1-100]
. . . . . . . . . . . [101-200]
. . . . . . . . . . . [201-300]
. . . . . . . . . . . [301-400]
. . . . . . . . . . . [401-500]
. . . . . . . . . . . [501-600]
. . . . . . . . . . Working on child [id 30591]
. . . . . . . . . . + whole subtree ...
. . . . . . . . . . . Downloading [1000 sqs] ...
. . . . . . . . . . . . [1-100]
. . . . . . . . . . . . [101-200]
. . . . . . . . . . . . [201-300]
. . . . . . . . . . . . [301-400]
. . . . . . . . . . . . [401-500]
. . . . . . . . . . . . [501-600]
. . . . . . . . . . . . [601-700]
. . . . . . . . . . . . [701-800]
. . . . . . . . . . . . [801-900]
. . . . . . . . . . . . [901-1000]
. . . . . . . . . . . Working on child [id 9505]
. . . . . . . . . . . + whole subtree ...
. . . . . . . . . . . . Downloading [300 sqs] ...
. . . . . . . . . . . . . [1-100]
. . . . . . . . . . . . . [101-200]
. . . . . . . . . . . . . [201-300]
Successfully retrieved [3238 sqs] in total.
-----------------------------------------------
Completed stage DOWNLOAD: [2023-01-03 05:12:04]
-----------------------------------------------
---------------------------------------------
Starting stage CLUSTER: [2023-01-03 05:12:04]
---------------------------------------------
Working from [id 9504] down hierarchy
. + direct [id 9504 (genus)]
. + subtree [id 9504 (genus)]
. . BLASTing [3238 sqs] ....
. . Running makeblastdb
. . Running blastn
. . Removed [86407/181398] hits due to insufficient coverage
. . Identified [794] clusters
. + parent [id 9504]
. . + direct [id 2688256 (no rank)]
. . + subtree [id 2688256 (no rank)]
. . . BLASTing [34 sqs] ....
. . . Removed [0/417] hits due to insufficient coverage
. . . Identified [15] clusters
. . + parent [id 2688256]
. . . + subtree [id 1230482 (species)]
. . + parent [id 2688256]
. . . + subtree [id 1090913 (species)]
. . + parent [id 2688256]
. . . + subtree [id 1002694 (species)]
. . + parent [id 2688256]
. . . + subtree [id 940829 (species)]
. . + parent [id 2688256]
. . . + subtree [id 413234 (species)]
. . + parent [id 2688256]
. . . + subtree [id 261316 (species)]
. . + parent [id 2688256]
. . . + subtree [id 231953 (species)]
. . + parent [id 2688256]
. . . + subtree [id 222417 (species)]
. + parent [id 9504]
. . + subtree [id 1263727 (species)]
. + parent [id 9504]
. . + subtree [id 361674 (species)]
. + parent [id 9504]
. . + subtree [id 292213 (species)]
. + parent [id 9504]
. . + subtree [id 57176 (species)]
. + parent [id 9504]
. . + subtree [id 57175 (species)]
. + parent [id 9504]
. . + subtree [id 43147 (species)]
. + parent [id 9504]
. . + subtree [id 37293 (species)]
. + parent [id 9504]
. . + direct [id 30591 (species)]
. . . BLASTing [565 sqs] ....
. . . Removed [36203/51582] hits due to insufficient coverage
. . . Identified [264] clusters
. . + subtree [id 30591 (species)]
. . . BLASTing [435 sqs] ....
. . . Removed [911/7414] hits due to insufficient coverage
. . . Identified [71] clusters
. . + parent [id 30591]
. . . + subtree [id 867331 (subspecies)]
. . + parent [id 30591]
. . . + subtree [id 280755 (subspecies)]
. . + parent [id 30591]
. . . + subtree [id 120088 (subspecies)]
. + parent [id 9504]
. . + subtree [id 9505 (species)]
[1/1]
----------------------------------------------
Completed stage CLUSTER: [2023-01-03 05:12:23]
----------------------------------------------
-----------------------------------------------
Starting stage CLUSTER^2: [2023-01-03 05:12:23]
-----------------------------------------------
Loading clusters ...
Done. Only one cluster set -- skipping cluster^2
Dropping all clusters of < 3 sqs ...
Renumbering clusters ...
Saving ...
------------------------------------------------
Completed stage CLUSTER^2: [2023-01-03 05:12:24]
------------------------------------------------
-------------------------------------------
Completed pipeline at [2023-01-03 05:12:24]
-------------------------------------------
r$> all_clusters <- read_phylota(wd)
r$> print(all_clusters)
[1] "Phylota Table (Aotus)\n- [184] clusters\n- [1945] sequences\n- [13] source taxa\n"
r$> all_clusters
Phylota Table (Aotus)
- [184] clusters
- [1945] sequences
- [13] source taxa
r$> cids <- all_clusters@cids
r$> cids
[1] "0" "1" "2" "3" "4" "5" "6" "7" "8"
[10] "9" "10" "11" "12" "13" "14" "15" "16" "17"
[19] "18" "19" "20" "21" "22" "23" "24" "25" "26"
[28] "27" "28" "29" "30" "31" "32" "33" "34" "35"
[37] "36" "37" "38" "39" "40" "41" "42" "43" "44"
[46] "45" "46" "47" "48" "49" "50" "51" "52" "53"
[55] "54" "55" "56" "57" "58" "59" "60" "61" "62"
[64] "63" "64" "65" "66" "67" "68" "69" "70" "71"
[73] "72" "73" "74" "75" "76" "77" "78" "79" "80"
[82] "81" "82" "83" "84" "85" "86" "87" "88" "89"
[91] "90" "91" "92" "93" "94" "95" "96" "97" "98"
[100] "99" "100" "101" "102" "103" "104" "105" "106" "107"
[109] "108" "109" "110" "111" "112" "113" "114" "115" "116"
[118] "117" "118" "119" "120" "121" "122" "123" "124" "125"
[127] "126" "127" "128" "129" "130" "131" "132" "133" "134"
[136] "135" "136" "137" "138" "139" "140" "141" "142" "143"
[145] "144" "145" "146" "147" "148" "149" "150" "151" "152"
[154] "153" "154" "155" "156" "157" "158" "159" "160" "161"
[163] "162" "163" "164" "165" "166" "167" "168" "169" "170"
[172] "171" "172" "173" "174" "175" "176" "177" "178" "179"
[181] "180" "181" "182" "183"
r$> n_taxa <- get_ntaxa(phylota = all_clusters, cid = cids)
r$> n_taxa
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
5 10 5 2 2 9 9 9 9 9 8 9 9 9 4
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
1 9 8 8 8 8 2 1 1 1 1 4 2 2 1
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
1 1 2 2 2 4 2 1 1 1 1 2 2 2 2
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
2 2 2 2 2 2 3 2 3 1 2 4 3 1 1
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
1 1 1 4 2 1 1 6 2 5 1 1 4 4 5
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
1 1 1 5 1 1 1 3 5 5 5 5 5 5 5
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
5 5 5 5 5 3 4 4 4 4 4 4 4 4 4
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134
4 4 4 4 4 4 4 4 4 4 4 3 3 4 4
135 136 137 138 139 140 141 142 143 144 145 146 147 148 149
1 4 4 1 1 1 2 2 3 3 3 3 3 3 3
150 151 152 153 154 155 156 157 158 159 160 161 162 163 164
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
3 3 3 3 3 3 1 3 1 1 3 1 1 1 1
180 181 182 183
1 1 2 2
r$> keep <- cids[taxa_all > 10]
Error: object 'taxa_all' not found
r$> n_taxa <- get_ntaxa(phylota = all_clusters, cid = cids)
r$> n_taxa
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
5 10 5 2 2 9 9 9 9 9 8 9 9 9 4
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
1 9 8 8 8 8 2 1 1 1 1 4 2 2 1
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
1 1 2 2 2 4 2 1 1 1 1 2 2 2 2
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
2 2 2 2 2 2 3 2 3 1 2 4 3 1 1
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
1 1 1 4 2 1 1 6 2 5 1 1 4 4 5
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
1 1 1 5 1 1 1 3 5 5 5 5 5 5 5
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
5 5 5 5 5 3 4 4 4 4 4 4 4 4 4
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134
4 4 4 4 4 4 4 4 4 4 4 3 3 4 4
135 136 137 138 139 140 141 142 143 144 145 146 147 148 149
1 4 4 1 1 1 2 2 3 3 3 3 3 3 3
150 151 152 153 154 155 156 157 158 159 160 161 162 163 164
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
3 3 3 3 3 3 1 3 1 1 3 1 1 1 1
180 181 182 183
1 1 2 2
r$> cids
[1] "0" "1" "2" "3" "4" "5" "6" "7" "8"
[10] "9" "10" "11" "12" "13" "14" "15" "16" "17"
[19] "18" "19" "20" "21" "22" "23" "24" "25" "26"
[28] "27" "28" "29" "30" "31" "32" "33" "34" "35"
[37] "36" "37" "38" "39" "40" "41" "42" "43" "44"
[46] "45" "46" "47" "48" "49" "50" "51" "52" "53"
[55] "54" "55" "56" "57" "58" "59" "60" "61" "62"
[64] "63" "64" "65" "66" "67" "68" "69" "70" "71"
[73] "72" "73" "74" "75" "76" "77" "78" "79" "80"
[82] "81" "82" "83" "84" "85" "86" "87" "88" "89"
[91] "90" "91" "92" "93" "94" "95" "96" "97" "98"
[100] "99" "100" "101" "102" "103" "104" "105" "106" "107"
[109] "108" "109" "110" "111" "112" "113" "114" "115" "116"
[118] "117" "118" "119" "120" "121" "122" "123" "124" "125"
[127] "126" "127" "128" "129" "130" "131" "132" "133" "134"
[136] "135" "136" "137" "138" "139" "140" "141" "142" "143"
[145] "144" "145" "146" "147" "148" "149" "150" "151" "152"
[154] "153" "154" "155" "156" "157" "158" "159" "160" "161"
[163] "162" "163" "164" "165" "166" "167" "168" "169" "170"
[172] "171" "172" "173" "174" "175" "176" "177" "178" "179"
[181] "180" "181" "182" "183"
r$> keep <- cids[n_taxa > 10]
r$> selected <- drop_clstrs(phylota = all_clusters, cid = keep
)
r$> reduced <- drop_by_rank(selected, rnk = 'species', n=1)
r$> reduced
Phylota Table (Aotus)
- [0] clusters
- [0] sequences
- [0] source taxa
r$> keep <- cids[n_taxa >= 4]
r$> keep
[1] "0" "1" "2" "5" "6" "7" "8" "9" "10" "11" "12"
[12] "13" "14" "16" "17" "18" "19" "20" "26" "35" "56" "63"
[23] "67" "69" "72" "73" "74" "78" "83" "84" "85" "86" "87"
[34] "88" "89" "90" "91" "92" "93" "94" "95" "96" "97" "98"
[45] "99" "100" "101" "102" "103" "104" "105" "106" "107" "108" "109"
[56] "111" "112" "113" "114" "115" "116" "117" "118" "119" "120" "121"
[67] "122" "123" "124" "125" "126" "127" "128" "129" "130" "133" "134"
[78] "136" "137"
r$> keep <- cids[n_taxa >= 5]
r$> selected <- drop_clstrs(phylota = all_clusters, cid = keep)
r$> selected
Phylota Table (Aotus)
- [48] clusters
- [1073] sequences
- [13] source taxa
r$> reduced <- drop_by_rank(selected, rnk = 'species', n=1)
Hi @ShixiangWang,
the zip file is too large for uploading (77 MB), please download it from here.
I've also tried your example above, but without success, still getting the same error Error in phylota@sqs@sqs[[i]] : attempt to select less than one element in get1index
.
Cheers Bastian
@bheimbu You can send it to shixiang1994wang@gmail.com.
It's a little strange, did you set the id to 9504? My whole directory is only 16 MB. If you also set 9504
but the working directory is 77MB, please retry with another new working directory. If the error occurs, send the zipped directory via the email above.
@ShixiangWang ➜ /workspaces $ du -sh test2
16M test2
@ShixiangWang ➜ /workspaces $ ls test2
blast blast_versions.txt cache log.txt session_info.txt
Best,
Shixiang
Hi,
no sorry I'm using taxid 1912919 (Termitoidae (termites)).
Cheers Bastian
@bheimbu Thanks, I got it.
Hi @ShixiangWang,
do you have any news or solution?
Cheers Bastian
@bheimbu I can reproduce the error in my test environment. It would take some time to explore the error and solution. Thanks for your example data. I will reply when I have any updates.
@bheimbu Please install the latest version from GitHub and retry again, the specific error should be fixed
remotes::install_github('ropensci/phylotaR')
@ShixiangWang it works, many thx.
Great, happy to hear that.
Hi, I am trying to reduce the sequence data by using drop_by_rank() but getting the following error:
Error in phylota@sqs@sqs[[i]] : attempt to select less than one element in get1index
I am running the pipeline according to the tutorial:
library(phylotaR) wd <- '/media/... ncbi_dr <- '/home/... txid <- 4747 setup(wd = wd, txid = txid, ncbi_dr = ncbi_dr, v=TRUE) run(wd = wd)
all_clusters <- read_phylota(wd) print(all_clusters) cids <- all_clusters@cids n_taxa <- get_ntaxa(phylota = all_clusters, cid = cids) keep <- cids[taxa_all > 10] selected <- drop_clstrs(phylota = all_clusters, cid = keep) reduced <- drop_by_rank(selected, rnk = 'species', n=1)
My experience with R is unfortunately quite limited and I appreciate the help. Thanks!