vlothec / TRASH

RepeatIdentifier
MIT License
50 stars 3 forks source link

Sequence template error #3

Open alexxjss opened 1 year ago

alexxjss commented 1 year ago

Hello,

I'm having this issue when trying to use a sequence template .csv file:

Warning message: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on

I'm using multiple template sequences in the same file, but only the first one is being identified.

Thanks

vlothec commented 1 year ago

Hi, is the sequence table formatted with unix style line endings? You can use dos2unix command in that case.

alexxjss commented 1 year ago

The error is gone, thank you. I have four previously identified satellite sequences (from centromeric chip-seq), and TRASH is finding a match for only one of them. Do you have any advice about fine-tuning that could help identify the others? Thanks

vlothec commented 1 year ago

I understand that the sequences from chip-seq are the consensus sequences of 4 different classes of repeats?

To avoid erroneous matching, TRASH is using the length of the provided consensus first by checking if the repeat it checks is around the length of a template it compares it to. I would look for an issue there first. If you make a histogram of repeat lengths from the output, do the peaks correspond to the lengths of provided templates?

Another approach to classify sequences (outside of TRASH) could be to make a phylogenetic tree and look for distinct branches that would indicate groups of sequences.

Just a note, I am working on a new method for assigning classes to individual repeats, should be live in a few weeks.

alexxjss commented 1 year ago

Yes, there are 4 different consensus sequences.

Two of the chip-seq monomers have more than 1000 bp, and satellites from TRASH all have less than 300 bp. My main goal is to check the HOR arrangement for those 4 centromeric satellites.

For sure, I can wait for the new method then.

Thank you!!

vlothec commented 1 year ago

Right, so the first step would be to increase the maximum repeat size that TRASH will look for. To make runs shorter and since many tandem repeats are in a rough range of 100-500 bp, the default setting is 800 bp. To adjust that for repeats of slightly more than 1000 bp, --win 2000 --m 1400 would be good. The first option increases windows sizes in which local repetition patters are looked for, so good approximation is to have it at double the expected repeat size, and the second one is hard capping the identified repeat itself.

Long repeats (few kbp and more) tend to contain internal repeats that can be preferentially identified by TRASH, if that's the case, let me know and maybe it can be addressed by other settings.

alexxjss commented 1 year ago

It works great! Now the software found all the four centromeric sequences I used as a template.

I'm trying to run it on the HOR mode by adding --horclass <centSat-class-name> --horonly, and I'm getting this error:

Error in file(file, ifelse(append, "a", "w")) : cannot open the connection Calls: calc.edit.distance -> write -> cat -> file In addition: Warning message: In file(file, ifelse(append, "a", "w")) : cannot open file 'Lcu.1GRN.fa_out/Lcu.1GRN.Chr1.out.txt': No such file or directory Execution halted

vlothec commented 1 year ago

Just to mention, the --horclass <centSat-class-name> --horonly still requires to provide the --seqt and --o arguments used in the initial run. but I don't think that's the issue here. "calc.edit.distance" function which is causing the crash is executed at the very beginning in this mode. Error relates to the progress update saved in "*.out.txt" file, which even if non-existant, would be created by write() function, so the issue should come from the directory in which the text file would be, in this case "Lcu.1GRN.fa_out". It should be created during the initial run.

Can you check if that folder indeed doesn't exist and upload the console output from the run?

alexxjss commented 1 year ago

Yes, I'm using the same command, just adding the extra arguments for the HOR analysis.

I'm also using the same --out directory from the original run (which was a non-HOR), so inside this directory, the "Lcu.1GRN.fa_out" exists with all the ".txt" files.

I've tried to run using a different --out, and it gives this error:

Error in file(con, "r") : cannot open the connection Calls: read.fasta -> readLines -> file In addition: Warning messages: 1: In file(con, "r") : 'raw = FALSE' but '' is not a regular file 2: In file(con, "r") : cannot open file 'out-path': it is a directory Execution halted

blavetn commented 1 year ago

Hello, I am getting similar error trying to use the --horclass <centSat-class-name> --horonly command after my successful TRASH run: Error in file(file, ifelse(append, "a", "w")) : cannot open the connection Calls: calc.edit.distance -> write -> cat -> file In addition: Warning message: In file(file, ifelse(append, "a", "w")) : cannot open file 'AURI.FINAL.fasta_out/AURI_chr_1.out.txt': No such file or directory Execution halted Of course the file exist...

blavetn commented 1 year ago

Hello, I have found how to run it. It is needed to be in the parent directory were is located the "_out" directory.

>cd AURI_TRASH
>ls
all.repeats.from.AURI.FINAL.fasta.csv
AURI.FINAL.fasta_out
HOR
plots
Summary.of.repetitive.regions.AURI.FINAL.fasta.csv
temp.all.repeats.from.AURI.FINAL.fasta.csv
TRASH_AURI.FINAL.fasta.gff

>bash ../TRASH_run.sh /mnt/nfs/shared/CFBioinformatics/workspace/nicolas/biscutella/2023.04_assembly_results/AURI.FINAL.fasta --seqt /mnt/nfs/shared/CFBioinformatics/workspace/nicolas/AURI_TRASH_SAT.csv --o /mnt/ssd/ssd_1/workspace/nicolas/TRASH/AURI_TRASH --par 20 --randomseed 1411833438 --m 1500 --win 2000 --horonly --horclass 'SAT000014a'

Also I add to make src/HOR.V3.3 executable with chmod

Nevertheless, I am not sure I am getting the expected output...

alexxjss commented 1 year ago

Hello,

@blavetn I've used the same strategy, and it worked! Thanks a lot.