heche-psb / wgd

wgd v2: a suite of tools to uncover and date ancient polyploidy and whole-genome duplication
https://wgdv2.readthedocs.io/en/latest/
GNU General Public License v3.0
25 stars 0 forks source link

ksd not finishing #39

Closed EmilyPhelps closed 1 month ago

EmilyPhelps commented 4 months ago

Hello,

I've been trying to use wgd V2 to calculate a Ks peak

I've taken a subset of my CDS (~3K sequences)

When I run

 wgd dmd subset.fasta 

I get: Screenshot 2024-06-25 at 11 23 09 But it finishes. So if I run:

wgd ksd wgd_dmd/subset.fasta.tsv subset.fasta

It will start off okay and just slow to a stop. I also get this error, for pretty much each gene family: Screenshot 2024-06-25 at 11 29 45

heche-psb commented 4 months ago

Hi, thanks for your interest in using wgd v2. You usage with wgd dmd and wgd ksd is correct. The log information also looks normal. The warning for the family with stripped alignment length as 0 is not an error but a common feature for large families (for instance, >200 gene members) that after removing all the gap-containing columns (any column with '-') there is no column left and it's thus impossible to calculate Ks. If you want to calculate the Ks values for these large families anyhow, you may add the flag option --pairwise in wgd ksd command to calculate Ks values based on local pairwise alignment of each pair rather than of the whole family, which is expected to be less gappy. If your run with wgd ksd stopped unexpectedly, the first thing I suggest you to check is whether you gave sufficient memory to each thread you assigned.

EmilyPhelps commented 4 months ago

Thanks for the response- How would I go about checking if the memory is limiting the process? Even when I use very few threads it stops. I have 126G of memory and 64 cores, but the most I've used for this is 40, the fewest is 4 and it stops each time.

EmilyPhelps commented 4 months ago

Hi - Just to provide more information here. I've been trying to run a subset of ~3K sequences on 4 nodes now for about 16 hours and it seem to get stuck.

Screenshot 2024-07-04 at 09 26 37

It looks like its still running but when I check the htop there is nothing running (apart from angsd which I just started for another project). This time i tried using the pairwise flag.

Screenshot 2024-07-04 at 09 26 21
heche-psb commented 4 months ago

Hi, it seems that the wgd program was not occupying the memory properly in your system. Did you add such command in your submitting script to set the memory for each thread? Like below I distributed 1 core/thread and 20G for the job.

#!/bin/bash
#$ -pe serial 1
#$ -l h_vmem=20G
EmilyPhelps commented 4 months ago

Hi- I included that in the script and tried to run it again, but it still seems stuck in the same place and is not actually running when you look at htop

Screenshot 2024-07-05 at 09 51 25
heche-psb commented 4 months ago

It could be indeed a problem. Could you please try to use only 1 thread and distribute for instance 40Gb memory for that thread to see if it got stuck again?

EmilyPhelps commented 3 months ago

Hi I am running only the one core now, with memory set to 40G by including #$ -l h_vmem=40G in the script. It does not look like it is using the 40G.

Screenshot 2024-07-06 at 16 48 36

The memory allocation seems incorrect - saying there are only 6G? Perhaps this is the issue?

Screenshot 2024-07-06 at 16 51 42
PraveenOraon commented 3 months ago

Hi, Thanks for developing wgd2 an important tool for tracing down the ancient polyploidization events. I also ran into same problem like @EmilyPhelps wgd ksd running endless without any new files being generated in the temp folder or error been thrown onto the terminal.

heche-psb commented 3 months ago

Hi, a quick test, could you please download the small size test file https://github.com/heche-psb/wgd/blob/phylodating/test/data/egu1000.fasta and do wgd dmd egu1000.fasta; wgd ksd wgd_dmd/egu1000.fasta.tsv egu1000.fasta and see if you can make it run successfully?

EmilyPhelps commented 3 months ago

Hi- Just ran the test and it completed without issue!

PraveenOraon commented 3 months ago

Hi, Same the test ran successfully.

The one which u suggest wgd dmd egu1000.fasta; wgd ksd wgd_dmd/egu1000.fasta.tsv egu1000.fasta is a whole paranome approach for ksd. What I am following is through orthologous approach first then ksd wgd dmd cds1.fasta cds2.fasta cds3.fasta --globalmrbh followed by wgd ksd global_MRBH.tsv *cds.fasta --pairwise -o wgd_globalmrbh_ks -n 36 .

Regards Praveen

heche-psb commented 3 months ago

Hi, I think the problem is with the datafile you used, perhaps too large. Could you also randomly select only one thousand sequences of your own datafile to repeat what you have done with egu1000.fasta and see whether it got stuck?

PraveenOraon commented 3 months ago

Hi, I successfully ran the tool as per your suggestion on individual species file on full data (No random sequence selection was done). Also I was able to visualize the wgd events in my species attaching the results. Following are the commands I used.

  1. wgd' dmd SpeciesA.cds -o wgd_SpeciesA
  2. wgd ksd -n 30 wgd_SpeciesA/SpeciesA.cds.tsv SpeciesA.cds
  3. wgd viz -d wgd_ksd/SpeciesA.cds.tsv.ks.tsv -o wgd_ELMM

Four species were used for the analysis Picture1

KDE_gh Plot

Please Let me know how to troubleshoot the problem we are facing in analyzing in orthologous mode. As I wanted to perform that and get the final output with WGD dating and finally placed in a tree.

-Regards Praveen

heche-psb commented 3 months ago

Hi, it's nice to see the two prominent WGD peaks given by the ELMM result. To achieve rate correction, we need to first infer the globalMRBH (or less preferred localMRBH) families, based on which we infer orthologous Ks values and subsequently perform the rate correction. The command to realize such job can be set as below. May I know a bit more details of which step didn't work in your dataset and could you share the error log please, which will help us debug further.

wgd dmd --globalmrbh *allspecies -o wgd_globalmrbh
wgd ksd wgd_globalmrbh/global_MRBH.tsv *allspecies -o wgd_globalmrbh_ks
wgd viz -d wgd_globalmrbh_ks/global_MRBH.tsv.ks.tsv -fa focal_species -epk wgd_ksd/focal_species.tsv.ks.tsv -ap wgd_syn/iadhore-out/anchorpoints.txt -sp speciestree.nw -o wgd_viz_mixed_Ks --plotelmm --plotapgmm --reweight
RohanNathHERE commented 3 months ago

2024-07-24 16:36:17 INFO Analysing family GF00000059 core.py:3064 2024-07-24 16:36:17 INFO Analysing family GF00000060 core.py:3064

I'm encountering the same issue as Emily with the 'ksd' process. Despite allocating 200G of memory and 60 threads, it is still taking an excessive amount of time. I'm processing a whole genome, and after 'dmd', the total number of gene families is 13,166.

heche-psb commented 3 months ago

Hi, I guess it's a memory-insufficient issue caused by some ultra-large families (size>200). A quick validation, could you remove the first 200 large paralogous families and conduct the ksd command again on the reduced gene family file to see if it got stuck? Besides, are you using the --pairwise mode which entails much more calculations per family and what is the distribution of your gene family sizes?

2024-07-24 16:36:17 INFO Analysing family GF00000059 core.py:3064 2024-07-24 16:36:17 INFO Analysing family GF00000060 core.py:3064

I'm encountering the same issue as Emily with the 'ksd' process. Despite allocating 200G of memory and 60 threads, it is still taking an excessive amount of time. I'm processing a whole genome, and after 'dmd', the total number of gene families is 13,166.

RohanNathHERE commented 3 months ago

Hello,

I removed the first 200 paralogous families and ran the ksd analysis again. However, the process now gets stuck at family 618. I observed that the process starts off quickly but then gradually slows down. Could this be a memory issue?

Before removing the 200 families, the maximum size of a gene family was 126 genes, and there were a total of 70,849 rows. After the removal, the maximum size of a gene family is now 7 genes, and there are 70,649 rows remaining. This dataset consists only of CDS sequences from a molluscan species, following the ab initio annotation of its genome.

Let me know if you have any suggestions or if there's anything else I should check.

heche-psb commented 3 months ago

Hi, a quick question, is family 618 a singleton family? I notice that sometimes the ksd run stops when it goes through singleton family.

RohanNathHERE commented 3 months ago

No, family 618 is not a singleton; it contains three genes. Singleton families are skipped at the start of the ksd analysis. Out of the 70,649 gene families, I'm left with 2,548 gene families for the analysis. Here’s a snippet from my log for your reference:

INFO Skipping singleton family GF00070846['cds.faa_46294'] INFO Skipping singleton family GF00070847['cds.faa_00464'] INFO Skipping singleton family GF00070848['cds.faa_27309'] INFO 60 threads are used for 2,548 gene families Note that adding threads can significantly accelerate the Ks estimation process 2024-07-25 15:32:08 INFO Analysing family GF00000201 2024-07-25 15:32:08 INFO Analysing family GF00000202 2024-07-25 15:32:08 INFO Analysing family GF00000203

heche-psb commented 3 months ago

Hi, the occasion that I mention ksd stops at singleton family is due to memory overflow when concatenating all the individual .csv files in the end, which seems irrelevant in your case. Another check about the family 618, could you cd into the tmp folder for this family and see whether the .csv file has been produced already?

RohanNathHERE commented 3 months ago

I was reviewing the temporary files and found that GF00000617 has been executed successfully. However, GF00000618 appears to be stuck at a specific step. Below is the current content of the directory for your reference:

GF00000617 total 36K -rw-rw-r-- 1 workspace2 workspace2 1.6K Jul 25 16:06 pro.fasta -rw-rw-r-- 1 workspace2 workspace2 2.3K Jul 25 16:06 pro.aln -rw-rw-r-- 1 workspace2 workspace2 399 Jul 25 16:06 GF00000617.ctrl -rw-rw-r-- 1 workspace2 workspace2 6.5K Jul 25 16:06 GF00000617.cdsaln -rw-rw-r-- 1 workspace2 workspace2 5.6K Jul 25 16:06 GF00000617.codeml -rw-rw-r-- 1 workspace2 workspace2 81 Jul 25 16:06 pro.aln.nw -rw-rw-r-- 1 workspace2 workspace2 698 Jul 25 16:06 GF00000617_ks.csv

GF00000618 total 20K -rw-rw-r-- 1 workspace2 workspace2 1.7K Jul 25 16:06 pro.fasta -rw-rw-r-- 1 workspace2 workspace2 2.3K Jul 25 16:06 pro.aln -rw-rw-r-- 1 workspace2 workspace2 399 Jul 25 16:06 GF00000618.ctrl -rw-rw-r-- 1 workspace2 workspace2 6.7K Jul 25 16:06 GF00000618.cdsaln -rw-rw-r-- 1 workspace2 workspace2 0 Jul 25 16:06 rub -rw-rw-r-- 1 workspace2 workspace2 0 Jul 25 16:06 rst1 -rw-rw-r-- 1 workspace2 workspace2 0 Jul 25 16:06 rst -rw-rw-r-- 1 workspace2 workspace2 0 Jul 25 16:06 GF00000618.codeml -rw-rw-r-- 1 workspace2 workspace2 0 Jul 25 16:06 2NG.t -rw-rw-r-- 1 workspace2 workspace2 0 Jul 25 16:06 2NG.dS -rw-rw-r-- 1 workspace2 workspace2 0 Jul 25 16:06 2NG.dN -rw-rw-r-- 1 workspace2 workspace2 0 Jul 25 16:06 2ML.t -rw-rw-r-- 1 workspace2 workspace2 0 Jul 25 16:06 2ML.dS -rw-rw-r-- 1 workspace2 workspace2 0 Jul 25 16:06 2ML.dN

heche-psb commented 3 months ago

Hi, could you try to conduct codeml GF00000618.ctrl and see what you get?

RohanNathHERE commented 3 months ago

Yeah, I tried codeml GF00000618.ctrl and got the following results:

total 76K lrwxrwxrwx 1 workspace2 workspace2 29 May 29 2020 codeml.ctl -> /usr/lib/paml/data/codeml.ctl -rw-rw-r-- 1 workspace2 workspace2 1.7K Jul 25 16:06 pro.fasta -rw-rw-r-- 1 workspace2 workspace2 2.3K Jul 25 16:06 pro.aln -rw-rw-r-- 1 workspace2 workspace2 399 Jul 25 16:06 GF00000618.ctrl -rw-rw-r-- 1 workspace2 workspace2 6.7K Jul 25 16:06 GF00000618.cdsaln -rw-rw-r-- 1 workspace2 workspace2 5.9K Jul 25 16:52 mlc -rw-rw-r-- 1 workspace2 workspace2 6.5K Jul 25 16:52 lnf -rw-rw-r-- 1 workspace2 workspace2 0 Jul 25 16:53 rub -rw-rw-r-- 1 workspace2 workspace2 161 Jul 25 16:53 rst1 -rw-rw-r-- 1 workspace2 workspace2 444 Jul 25 16:53 rst -rw-rw-r-- 1 workspace2 workspace2 5.6K Jul 25 16:53 GF00000618.codeml -rw-rw-r-- 1 workspace2 workspace2 97 Jul 25 16:53 2NG.t -rw-rw-r-- 1 workspace2 workspace2 97 Jul 25 16:53 2NG.dS -rw-rw-r-- 1 workspace2 workspace2 97 Jul 25 16:53 2NG.dN -rw-rw-r-- 1 workspace2 workspace2 82 Jul 25 16:53 2ML.t -rw-rw-r-- 1 workspace2 workspace2 82 Jul 25 16:53 2ML.dS -rw-rw-r-- 1 workspace2 workspace2 82 Jul 25 16:53 2ML.dN

heche-psb commented 3 months ago

If codeml GF00000618.ctrl works fine, which indicates no issue with this specific gene family, I would attribute the unexpected stopping of ksd still to the memory issue for some overflowed threads. I would suggest reducing the number of threads and increase the memory per thread, for instance 20 threads and per thread 10Gb.

RohanNathHERE commented 3 months ago

Thank you, I'll give it a try and keep you updated.

RohanNathHERE commented 3 months ago

Yes, I believe it is due to a memory issue. I've reduced the number of threads and increased the memory per thread. However, it now gets stuck at gene family 1056. Are there any other parameters that might help reduce the memory usage per thread?

heche-psb commented 3 months ago

Hi, the default parameter of ksd should cost the minimum memory already. Normally I would assign 4-8 threads and 15-20Gb per thread. You may also try the same. Let me know if it still got stuck. Besides, are you using paml v4.9j?

RohanNathHERE commented 3 months ago

Yeah, I'm using paml v4.9j. I'll keep you updated.

RohanNathHERE commented 3 months ago

I’m encountering an issue with running the wgd syn command. I processed a genome file using Augustus for annotation and extracted CDS sequences using grep -v "#" | grep "CDS" followed by extracting the sequences, resulting in cds.faa. Subsequently, I ran the following commands:

  1. wgd dmd cds.faa -e 1e-50 -I 2.0 -c 0.9 --orthoinfer -o wgd_dmd -n 60
  2. wgd ksd wgd_dmd/cds.faa.tsv cds.faa --cds --strip_gaps -o wgd_ksd -n 8

and got the cds.faa.tsv.ks.tsv file. However, now I'm running

wgd syn -f CDS -a ID wgd_dmd/cds.faa.tsv cds.gff -ks wgd_ksd/cds.faa.tsv.ks.tsv -o wgd_syn -n 40 but is facing an issue:

                ERROR    No genes from families file `wgd_dmd/cds.faa.tsv` found in the GFF file for `feature=CDS` and `attribute=ID`, please double check command settings.                                                                                      cli.py:696

This is my cds.gff, cds.faa.tsv.ks.tsv and cds.faa.tsv file: file.zip

Could you please help identify the potential cause of this error and suggest any steps to resolve it?

Thank you.

heche-psb commented 3 months ago

Hi, gene ids at the "CDS" rows after "ID=" should be the same as the gene ids in your cds.faa.tsv file, which is apparently not the case in your dataset. In you gene family file cds.faa.tsv gene ids are like "WURV02015968.1:2079-3951" while in your gff3 file gene ids are like "g53714.t1.cds". You need to synchronize the gene ids in all your datafiles.

RohanNathHERE commented 3 months ago

Thank you for your help. I have successfully modified the ID and gene name in the GFF file, and it now looks like:

WURV02000002.1:6299-7889 AUGUSTUS CDS 6300 7889 0.19 - 0 ID=WURV02000002.1:6299-7889;Parent=g1.t1 WURV02000002.1:8334-8365 AUGUSTUS CDS 8335 8365 0.66 + 0 ID=WURV02000002.1:8334-8365;Parent=g2.t1 WURV02000002.1:8489-8929 AUGUSTUS CDS 8490 8929 0.22 + 2 ID=WURV02000002.1:8489-8929;Parent=g2.t1 WURV02000002.1:8968-9208 AUGUSTUS CDS 8969 9208 1 + 0 ID=WURV02000002.1:8968-9208;Parent=g2.t1 WURV02000002.1:52499-52774 AUGUSTUS CDS 52500 52774 0.77 - 2 ID=WURV02000002.1:52499-52774;Parent=g3.t1 WURV02000002.1:56763-56785 AUGUSTUS CDS 56764 56785 0.24 - 0 ID=WURV02000002.1:56763-56785;Parent=g3.t1 WURV02000003.1:11572-11785 AUGUSTUS CDS 11573 11785 1 + 0 ID=WURV02000003.1:11572-11785;Parent=g4.t1 WURV02000003.1:11806-12016 AUGUSTUS CDS 11807 12016 0.38 + 0 ID=WURV02000003.1:11806-12016;Parent=g5.t1 WURV02000006.1:19188-19659 AUGUSTUS CDS 19189 19659 0.18 + 0 ID=WURV02000006.1:19188-19659;Parent=g6.t1 WURV02000006.1:37479-38097 AUGUSTUS CDS 37480 38097 0.59 - 0 ID=WURV02000006.1:37479-38097;Parent=g7.t1

I also changed the gene name and added the coordinates to the name due to a duplicate gene error.

However, I’m encountering an issue after downloading, compiling, and successfully testing the i-adhore tool.

                INFO     Processing I-ADHoRe output                    cli.py:711

Traceback (most recent call last): File "/home/workspace2/anaconda3/envs/wgd/bin/wgd", line 10, in sys.exit(cli()) File "/home/workspace2/anaconda3/envs/wgd/lib/python3.6/site-packages/click/core.py ", line 829, in call return self.main(args, kwargs) File "/home/workspace2/anaconda3/envs/wgd/lib/python3.6/site-packages/click/core.py ", line 782, in main rv = self.invoke(ctx) File "/home/workspace2/anaconda3/envs/wgd/lib/python3.6/site-packages/click/core.py ", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/workspace2/anaconda3/envs/wgd/lib/python3.6/site-packages/click/core.py ", line 1066, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/workspace2/anaconda3/envs/wgd/lib/python3.6/site-packages/click/core.py ", line 610, in invoke return callback(args, kwargs) File "/home/workspace2/anaconda3/envs/wgd/lib/python3.6/site-packages/cli.py", line 669, in syn _syn(kwargs) File "/home/workspace2/anaconda3/envs/wgd/lib/python3.6/site-packages/cli.py", line 713, in _syn anchors,orig_anchors = get_anchors(out_path) File "/home/workspace2/anaconda3/envs/wgd/lib/python3.6/site-packages/wgd/syn.py", line 194, in get_anchors else: anchors = pd.read_csv(os.path.join(out_path, "anchorpoints.txt"), sep="\t", index_col=0) File "/home/workspace2/anaconda3/envs/wgd/lib/python3.6/site-packages/pandas/io/par sers.py", line 688, in read_csv return _read(filepath_or_buffer, kwds) File "/home/workspace2/anaconda3/envs/wgd/lib/python3.6/site-packages/pandas/io/par sers.py", line 454, in _read parser = TextFileReader(fp_or_buf, kwds) File "/home/workspace2/anaconda3/envs/wgd/lib/python3.6/site-packages/pandas/io/par sers.py", line 948, in init self._make_engine(self.engine) File "/home/workspace2/anaconda3/envs/wgd/lib/python3.6/site-packages/pandas/io/par sers.py", line 1180, in _make_engine self._engine = CParserWrapper(self.f, self.options) File "/home/workspace2/anaconda3/envs/wgd/lib/python3.6/site-packages/pandas/io/par sers.py", line 2010, in init self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parsers.pyx", line 382, in pandas._libs.parsers.TextReader.cin it File "pandas/_libs/parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setu p_parser_source FileNotFoundError: [Errno 2] No such file or directory: '/storage2/wgd/lophotrochozoa /epimenia_babai/wgd_syn/iadhore-out/anchorpoints.txt'

What could be the potential reason?

heche-psb commented 3 months ago

Hi, a quick question. Could you show me the full log of your command and the content of your wgd_syn folder?

RohanNathHERE commented 3 months ago

Hello,

I'm attaching my cds.faa.tsv, cds.clean.gff, cds.faa.tsv.ks.tsv files, along with the log.txt file and the wgd_syn directory, for your reference. Thank you for your patience. I greatly appreciate your assistance. NOTE: In the gff3 file, I changed the ID and the gene name in the first column (due to a duplicate gene name error). I believe it has something to do with the format of my files. Please let me know if you need any more files for your reference.

files.zip

heche-psb commented 3 months ago

Hi, the first column of cds.clean.gff should be the chromosome or scaffold IDs, rather than gene ids.

RohanNathHERE commented 3 months ago

Hello, thank you for your clarification. I'll work on it and get the proper format.

RohanNathHERE commented 3 months ago

I changed my gff file into this particular format

WURV02000002.1 AUGUSTUS CDS 6300 7889 0.19 - 0 ID=WURV02000002.1:6299-7889;Parent=g1.t1 WURV02000002.1 AUGUSTUS CDS 8335 8365 0.66 + 0 ID=WURV02000002.1:8334-8365;Parent=g2.t1 WURV02000002.1 AUGUSTUS CDS 8490 8929 0.22 + 2 ID=WURV02000002.1:8489-8929;Parent=g2.t1 WURV02000002.1 AUGUSTUS CDS 8969 9208 1 + 0 ID=WURV02000002.1:8968-9208;Parent=g2.t1 WURV02000002.1 AUGUSTUS CDS 52500 52774 0.77 - 2 ID=WURV02000002.1:52499-52774;Parent=g3.t1 WURV02000002.1 AUGUSTUS CDS 56764 56785 0.24 - 0 ID=WURV02000002.1:56763-56785;Parent=g3.t1 WURV02000003.1 AUGUSTUS CDS 11573 11785 1 + 0 ID=WURV02000003.1:11572-11785;Parent=g4.t1 WURV02000003.1 AUGUSTUS CDS 11807 12016 0.38 + 0 ID=WURV02000003.1:11806-12016;Parent=g5.t1 WURV02000006.1 AUGUSTUS CDS 19189 19659 0.18 + 0 ID=WURV02000006.1:19188-19659;Parent=g6.t1 WURV02000006.1 AUGUSTUS CDS 37480 38097 0.59 - 0 ID=WURV02000006.1:37479-38097;Parent=g7.t1

Yet I'm getting the following error:

2024-07-26 15:29:02 INFO This is wgd v2.0.38 cli.py:34 INFO Checking cores and threads... core.py:35 INFO The number of logical CPUs/Hyper Threading in the system: 80 core.py:36 INFO The number of physical cores in the system: 20 core.py:37 INFO The number of actually usable CPUs in the system: 80 core.py:38 INFO Checking memory... core.py:40 INFO Total physical memory: 251.5079 GB core.py:41 INFO Available memory: 215.7972 GB core.py:42 INFO Free memory: 52.1904 GB core.py:43 2024-07-26 15:29:03 INFO Note: detected 80 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable. utils.py:144 INFO Note: NumExpr detected 80 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. utils.py:147 INFO NumExpr defaulting to 8 threads. utils.py:159 INFO Configuring I-ADHoRe co-linearity search cli.py:703 INFO Writing families file syn.py:96 INFO Writing gene lists syn.py:98 2024-07-26 15:29:09 ERROR There are duplicated gene IDs for given feature and attribute syn.py:121

Does that mean I have duplicates in the 9th column of my gff file? Which duplicates should I remove?

heche-psb commented 3 months ago

Hi, each row with CDS and ID tag should have unique gene ids. Can you make sure it's the case in your data?

RohanNathHERE commented 3 months ago

Hello, I'm still trying to locate the duplicate gene IDs in my cds.clean.gff file. Does the duplication lie in the 9th column? I've tested multiple times but still can't find any duplicate rows. I even converted the gene IDs and replaced ":" and "-" to convert the IDs into the format WURV02000018.1_22566_22857 in both the gff and fasta file. Could you please suggest any workarounds for this problem? The cds.clean.gff file is attached to the zip file above.

heche-psb commented 3 months ago

Hi, another quick notation. You should not use all the raw CDS per gene for collinearity search, but use one main transcript per gene, same for constructing whole paranome and Ks distribution.

RohanNathHERE commented 3 months ago

Hi, thank you for the quick solution. I'll try it and keep you updated.

heche-psb commented 3 months ago

The direct reason that caused the error log 2024-07-26 15:29:09 ERROR There are duplicated gene IDs for given feature and attribute syn.py:121 is that you have duplicated gene ids in your gene family file, for instance "WURV02226005.1:440-1065", which indicates something wrong with your gene id renaming process.

RohanNathHERE commented 3 months ago

I understand, I was using bedtools getfasta to fetch the CDS from the genome. Maybe I should switch to gffread to find a solution. Thank you for your suggestion.

RohanNathHERE commented 3 months ago

Hi, I used gffread to solve the problem. However, I'm facing yet another issue. Could you please look into it?

Screenshot from 2024-07-26 18-22-44 Screenshot from 2024-07-26 18-23-18 Screenshot from 2024-07-26 18-23-58

What does remove_unused_levels mean?

My gff3 file is in this format:

Screenshot from 2024-07-26 18-24-23

heche-psb commented 3 months ago

If you're using short CDS for collinearity search, you may set the option --mingenenum as 1 and --minseglen as 1 (please refer to README for the explanation of these parameters).

RohanNathHERE commented 3 months ago

Thanks! I'll go through the manual carefully.

RohanNathHERE commented 3 months ago

Hello, I hope you're doing well. I've managed to run the entire pipeline. However, I noticed that a few of the plots are missing from the directories, like the dupstackplot and dotplot. These are the commands which I ran:

  1. wgd dmd cds.faa --eval 1e-5 --inflation 1.5 -nthreads 50 --keepfasta --cscore 0.7 --orthoinfer
  2. wgd ksd wgd_dmd/cds.faa.tsv cds.faa --cds --strip_gaps --outdir wgd_ksd -nthreads 8
  3. wgd syn --feature CDS --attribute ID wgd_dmd/cds.faa.tsv cds.clean.gff --ks_distribution wgd_ksd/cds.faa.tsv.ks.tsv --outdir wgd_syn --nthreads 50 --gistrb --showrealtick --minseglen 1 --mingenenum 1 --maxsize 10
  4. wgd viz --datafile wgd_ksd/cds.faa.tsv.ks.tsv --outdir wgd_ELMM --plotkde --plotsyn --segments wgd_syn/iadhore-out/segments.txt --maxsize 20 --anchorpoints wgd_syn/iadhore-out/anchorpoints.txt --minseglen 1 --mingenenum 1 --plotapgmm --plotelmm --showrealtick --adjustortho --classic --toparrow --gistrb --genetable wgd_syn/gene-table.csv --nthreads 50 --multiplicon wgd_syn/iadhore-out/multiplicons.txt

This is the content of my wgd_syn directory:

total 12M drwxrwxr-x 2 workspace2 workspace2 80K Jul 29 14:02 cds.faa.lists drwxrwxr-x 2 workspace2 workspace2 226 Jul 29 14:03 iadhore-out -rw-rw-r-- 1 workspace2 workspace2 1.1M Jul 29 16:36 families.tsv -rw-rw-r-- 1 workspace2 workspace2 226K Jul 29 16:36 iadhore.conf -rw-rw-r-- 1 workspace2 workspace2 7.6M Jul 29 16:38 cds.faa_gene_order_perchrom.tsv -rw-rw-r-- 1 workspace2 workspace2 2.6M Jul 29 16:38 gene-table.csv -rw-rw-r-- 1 workspace2 workspace2 6.6K Jul 29 16:38 anchors.csv -rw-rw-r-- 1 workspace2 workspace2 85 Jul 29 16:38 segments_coordinates.tsv

and this is the content of the iadhore-out directory:

total 828K -rw-rw-r-- 1 workspace2 workspace2 0 Jul 29 16:38 alignment.txt -rw-rw-r-- 1 workspace2 workspace2 763K Jul 29 16:38 genes.txt -rw-rw-r-- 1 workspace2 workspace2 5.4K Jul 29 16:38 multiplicons.txt -rw-rw-r-- 1 workspace2 workspace2 1.9K Jul 29 16:38 baseclusters.txt -rw-rw-r-- 1 workspace2 workspace2 8.1K Jul 29 16:38 segments.txt -rw-rw-r-- 1 workspace2 workspace2 8.5K Jul 29 16:38 multiplicon_pairs.txt -rw-rw-r-- 1 workspace2 workspace2 16K Jul 29 16:38 list_elements.txt -rw-rw-r-- 1 workspace2 workspace2 11K Jul 29 16:38 anchorpoints.txt

Am I missing any parameter? How can I get the dupstack plot and the dotplot? Please let me know and thank you for your support.

heche-psb commented 3 months ago

Hi, I suggest the parameter "--maxsize 10" stick to the default "--maxsize 200" because "--maxsize 10" means you only consider gene families whose size is smaller than or equal to 10.

RohanNathHERE commented 3 months ago

Thank you for your suggestions. All of the pipelines are now running smoothly after several rounds of iterations and trials. So, my basic project includes identifying the presence of WGD events in mollusca. These are the Ks plots I obtained.

cds.faa.tsv.ksd.pdf

As per my understanding:

  1. A peak in the KS distribution typically represents a burst of gene duplications that occurred at the same time. This is a strong indicator of a WGD event. The position of the peak along the KS axis can give an estimate of the timing of the WGD.
  2. After a WGD, some duplicated genes might acquire new functions or undergo different selection pressures. Peaks in the KA and ω distributions can indicate functional divergence and selection pressures acting on these duplicates post-WGD.

In case of my data:

  1. A peak at KS = 0.1 suggests a relatively recent WGD event, as many duplicates have low KS values indicating they have not diverged much.
  2. log10KS values being left-skewed and clustering around 0.5 suggests that the majority of gene duplications have relatively low KS values.
  3. log10KA values following a normal distribution with a mean around -0.5 suggest that the non-synonymous substitution rates are distributed around a central value, with most duplicates showing moderate divergence.
  4. log10ω values being normally distributed with a mean near -1 indicates that the ratio of KA to KS is generally low, suggesting purifying selection is dominant.

Are my inferences correct? Please let me know.

Thank you for your patience!

heche-psb commented 2 months ago

Hi, I suggest conducting ELMM analysis using wgd viz to discern prominent Ks peaks by WGDs before you delve any deeper. Besides, how does your anchor Ks distribution look like?

RohanNathHERE commented 2 months ago

Hello, I've conducted the ELMM analysis using wgd viz and examined the prominent Ks peaks and the anchor Ks distribution.

According to my inferences:

The analysis identified a distinct peak in the log-transformed Ks distribution. The model selection process was guided by minimizing the BIC.

I've attached both the wgd_ELMM and the wgd_peak files for your reference.

wgd_ELMM.zip

wgd_peak.zip

Do you have any recommendations on further refining the model or additional analyses to perform? Are these and my inference correct?