roblanf / sangeranalyseR

functions to analyse sanger sequencing reads in R
MIT License
97 stars 24 forks source link

Error in vapply(object@contigList, function(contig) { : values must be length 1, but FUN(X[[1]]) result is length 2There were 33 warnings (use warnings() to see them) #53

Closed gabriellovate closed 3 years ago

gabriellovate commented 4 years ago

Hi,

When running:

my_aligned_contigs <- SangerAlignment(parentDirectory     = "F:/Google Drive/research/active_projects/functional_genomics_of_resistome/data/2020.08.07/ab1_files",
                                      namesConversionCSV  = "F:/Google Drive/research/active_projects/functional_genomics_of_resistome/results/2020.08.07/sangeranalyser_indexing.csv")

writeFasta(my_aligned_contigs, outputDir = "F:/Google Drive/research/active_projects/functional_genomics_of_resistome/results/2020.08.07/")

I'm getting the following error/warnings:


Error in vapply(object@contigList, function(contig) { : 
  values must be length 1,
 but FUN(X[[1]]) result is length 2There were 33 warnings (use warnings() to see them)
> 
> writeFasta(my_aligned_contigs, outputDir = "F:/Google Drive/research/active_projects/functional_genomics_of_resistome/results/2020.08.07/")
INFO [2020-08-07 19:43:47] Your input is 'SangerAlignment' S4 instance
INFO [2020-08-07 19:43:47] >>> outputDir : F:/Google Drive/research/active_projects/functional_genomics_of_resistome/results/2020.08.07/
INFO [2020-08-07 19:43:47] Start to write 'SangerAlignment' to FASTA format ...
INFO [2020-08-07 19:43:47] >> Writing 'alignment' to FASTA ...
INFO [2020-08-07 19:43:47] >> Writing 'contigs' to FASTA ...
INFO [2020-08-07 19:43:47] >> Writing all single reads to FASTA ...
Error in vapply(object@contigList, function(contig) { : 
  values must be length 1,
 but FUN(X[[1]]) result is length 2
> warnings()
Warning messages:
1: In read.abif(readFileName) : unimplemented legacy type found in file
2: In read.abif(readFileName) : unimplemented legacy type found in file
3: In read.abif(readFileName) : unimplemented legacy type found in file
4: In read.abif(readFileName) : unimplemented legacy type found in file
5: In read.abif(readFileName) : unimplemented legacy type found in file
6: In read.abif(readFileName) : unimplemented legacy type found in file
7: In read.abif(readFileName) : unimplemented legacy type found in file
8: In read.abif(readFileName) : unimplemented legacy type found in file
9: In read.abif(readFileName) : unimplemented legacy type found in file
10: In read.abif(readFileName) : unimplemented legacy type found in file
11: In read.abif(readFileName) : unimplemented legacy type found in file
12: In read.abif(readFileName) : unimplemented legacy type found in file
13: In IdClusters(dist, type = "both", showPlot = FALSE, processors = processorsNum,  ... :
  Substituting 1.28 for non-finite values in myDistMatrix.
14: In read.abif(readFileName) : unimplemented legacy type found in file
15: In read.abif(readFileName) : unimplemented legacy type found in file
16: In read.abif(readFileName) : unimplemented legacy type found in file
17: In read.abif(readFileName) : unimplemented legacy type found in file
18: In read.abif(readFileName) : unimplemented legacy type found in file
19: In read.abif(readFileName) : unimplemented legacy type found in file
20: In read.abif(readFileName) : unimplemented legacy type found in file
21: In read.abif(readFileName) : unimplemented legacy type found in file
22: In read.abif(readFileName) : unimplemented legacy type found in file
23: In read.abif(readFileName) : unimplemented legacy type found in file
24: In read.abif(readFileName) : unimplemented legacy type found in file
25: In read.abif(readFileName) : unimplemented legacy type found in file
26: In read.abif(readFileName) : unimplemented legacy type found in file
27: In read.abif(readFileName) : unimplemented legacy type found in file
28: In read.abif(readFileName) : unimplemented legacy type found in file
29: In read.abif(readFileName) : unimplemented legacy type found in file
30: In read.abif(readFileName) : unimplemented legacy type found in file
31: In read.abif(readFileName) : unimplemented legacy type found in file
32: In read.abif(readFileName) : unimplemented legacy type found in file
33: In read.abif(readFileName) : unimplemented legacy type found in file
Kuanhao-Chao commented 4 years ago

Hi @gabriellovate ,

Thanks for raising this issue. We'll work on it! I guess that the read names and the names in your CSV file are mismatched. If there's any chance you can send me your files off-list (kuanhao.chao@gmail.com) that would really help. We'll only use them for fixing this bug!

Howard

thokall commented 4 years ago

I got the same error without just following the tutorial here https://sangeranalyser.readthedocs.io/en/latest/content/beginner.html#step-2-loading-and-analysing-your-data, but with my own sequence files

thokall commented 4 years ago

Could the error be related to samples not generating a contig (no data that passes QC) so that there are empty contigs created ?

roblanf commented 4 years ago

I think it could well be. Howard - what we need are some QC steps that produce warnings when e.g. (i) any reads are left out of any contigs; (ii) any contigs end up with no reads at all; and anything else we can think users might want to be warned about when assembling all the data...

On Thu, 19 Nov 2020 at 19:03, Thomas Källman notifications@github.com wrote:

Could the error be related to samples not generating a contig (no data that passes QC) so that there are empty contigs created ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/roblanf/sangeranalyseR/issues/53#issuecomment-730199875, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG2SEYOWJHOLBECLBJK6J3SQTGMBANCNFSM4OVAUQ2A .

-- Rob Lanfear Division of Ecology and Evolution, Research School of Biology, The Australian National University, Canberra

www.robertlanfear.com

Kuanhao-Chao commented 4 years ago

hi @thokall,

I am working on the QC steps now. If you can send me your dataset, I can check where the problem is for you first (kuanhao.chao@gmail.com). Thank you !

Howard

thokall commented 4 years ago

Hi,

My reasoning came from the fact that for some samples I only had forward data and a subset of these were of very low quality. If I just loop over all individual .ab1 files and extract fasta I do get a result from all of them, but in some cases it just a single bp left. I presume that this base is left in to avoid having empty outputs. For contigs I would expect a consensus read to contain the high-quality sequence found in any read even if there is no matching read in the other direction.

Eg:

  1. Forward OK, reverse OK -> Consensus from the two reads (a true contig)
  2. Forward OK reverse BAD or vice versa -> Consensus is simply the high quality part of the okay read
  3. Forward bad, reverse bad -> Empty (or single bp) sequence

Will share data via mail asap

roblanf commented 4 years ago

Hi Thomas,

I think your reasoning is about right, and what the package does will depend on certain settings you have (e.g. how many reads need to overlap a position for that base to be included in the contig).

Once you send us the data we'll take a look and get back to you. I really appreciate your willingness to take the time to engage and help - it's the only way the package will improve!

Rob

On Tue, 24 Nov 2020 at 00:14, Thomas Källman notifications@github.com wrote:

Hi,

My reasoning came from the fact that for some samples I only had forward data and a subset of these were of very low quality. If I just loop over all individual .ab1 files and extract fasta I do get a result from all of them, but in some cases it just a single bp left. I presume that this base is left in to avoid having empty outputs. For contigs I would expect a consensus read to contain the high-quality sequence found in any read even if there is no matching read in the other direction.

Eg:

  1. Forward OK, reverse OK -> Consensus from the two reads (a true contig)
  2. Forward OK reverse BAD or vice versa -> Consensus is simply the high quality part of the okay read
  3. Forward bad, reverse bad -> Empty (or single bp) sequence

Will share data via mail asap

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/roblanf/sangeranalyseR/issues/53#issuecomment-732153679, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG2SE3HAWZYYPAWFIRAJWDSRJN2PANCNFSM4OVAUQ2A .

-- Rob Lanfear Division of Ecology and Evolution, Research School of Biology, The Australian National University, Canberra

www.robertlanfear.com

thokall commented 3 years ago

I have tried to create a minimal example to help identifying the issues, but it looks like the problem only occurs when I have more than 200 ab1 files as input. If split the analysis over two folders and run the analysis separately there is no longer any issues. But adding more than 200 to a single folder and running the analysis generates the following:

al <- SangerAlignment(parentDirectory = "~/seqsfish/ab/test",
                                     suffixForwardRegExp =
                                      "_[0-9]+_F+",
                                      suffixReverseRegExp =
                                      "_[0-9]+_R+")
WARN [2020-09-12 18:15:38] The number of your total reads is 0.
Number of total reads has to be equal or more than 2 ('minReadsNum' that you set)
INFO [2020-09-12 18:15:38] Aligning consensus reads ... 
INFO [2020-09-12 18:15:38] Before building!!
INFO [2020-09-12 18:15:40] After building!!
SUCCESS [2020-09-12 18:15:40]   >> 'SangerAlignment' S4 instance is created !!

and then

writeFasta(al, outputDir = "~/seqsfish/ab/test")

INFO [2020-09-12 18:15:48] Your input is 'SangerAlignment' S4 instance
INFO [2020-09-12 18:15:48] >>> outputDir : /home/thomkall/seqsfish/ab/test
INFO [2020-09-12 18:15:48] Start to write 'SangerAlignment' to FASTA format ...
INFO [2020-09-12 18:15:48] >> Writing 'alignment' to FASTA ...
INFO [2020-09-12 18:15:48] >> Writing 'contigs' to FASTA ...
INFO [2020-09-12 18:15:48] >> Writing all single reads to FASTA ...
Error in vapply(object@contigList, function(contig) { : 
  values must be length 1,
 but FUN(X[[10]]) result is length 2
roblanf commented 3 years ago

Hi @thokall, if you can send me or Howard your dataset (my contact: rob.lanfear@anu.edu.au) I'd be happy to take a look. We (of course) won't share your dataset with anyone. But it's really the only way we can debug things in a way that makes sure it will work for you.

thokall commented 3 years ago

I will check the option to do so. I trust you not to spread the data, but the since the data is not mine I need to get a green light from the owner, hence my attempt to create a minimal example and explore on my own.

tomsauv commented 3 years ago

Hello,

I get the same error from vapply, however the contigs and alignment are written to file anyway. (I am on macOS Catalina, developper version of SangeranalyseR)

I have noticed that in the report printed on screen, some chromatograms are doubled in number... e.g. below, AB4-3 shows with 2 forward reads and 2 reverse reads but in my folder each has only 1 chromatogram...

I don't mind sending my chromatograms to your email address if it helps

*>>Contig 'AB4-3':
SUCCESS [2021-21-01 20:24:26]           * >> 2 forward reads.
SUCCESS [2021-21-01 20:24:26]           * >> 2 reverse reads.
SUCCESS [2021-21-01 20:24:26]       
* >> Contig 'AB4-32':
SUCCESS [2021-21-01 20:24:26]           * >> 1 forward reads.
SUCCESS [2021-21-01 20:24:26]           * >> 1 reverse reads.
Kuanhao-Chao commented 3 years ago

Hi @tomsauv,

Thanks for raising this issue. We'll work on it! I guess maybe the regular expression that you use matches both reads. I'll take a look after you send me the files. You can send me your files (kuanhao.chao@gmail.com) and that would be really helpful. We'll only use them for fixing this bug!

Howard

roblanf commented 3 years ago

@Kuanhao-Chao, I guess this is something we hadn't considered (i.e. double counting).

Whether or not double counting is the issue in this case, we should add a test where double counting occurs. We should then add a check when parsing regular expressions that each read is assigned to one and only one group (e.g. Forward, reverse, or the contig groups). In the case that 1 or more reads could be assigned to >1 group, we should spit an Error with an error message that tablutes for each such read all the groups that it has been assigned to, as well as the suggestion that users can use a CSV file (with a link to the documentation for how to do it) if the regular expressions are not working.

Kuanhao-Chao commented 3 years ago

Hi @tomsauv,

Thank you for your bug report. I've fixed the problem of reads repeat. Please download sangeranalyseR again from the lastest master branch.

library(devtools)
install_github("roblanf/sangeranalyseR")

Let me know if there is any new problems. Thanks!

Howard

tomsauv commented 3 years ago

Runs without error! Thank you

thokall commented 3 years ago

I know this is closed, but still want to mention that it now works as expected. Thanks to all that contritubed and sorry that I could not share my input to try and find the problems earlier.

tomsauv commented 3 years ago

Hi,

I tried the .csv method and I get the vapply error. I noted as in my earlier message of Jan21 that "2 forward reads and 2 reverse reads" shows up. The assembly completes but then I get an error to export with writeFasta

However, when I run the same dataset with the regex method, I get no issue. None of the forward/reverse reads shows as 2, only 1s. I get no issue exporting wit writeFasta.

Would the fix you did earlier only apply to the regex method and not the csv method?

This time I am on PC and latest developer version installed (trying to use sangeranalyseR for teaching...).

Thanks, Thomas