sjteresi / TE_Density

Python script calculating transposable element density for all genes in a genome. Publication: https://mobilednajournal.biomedcentral.com/articles/10.1186/s13100-022-00264-4
GNU General Public License v3.0
28 stars 4 forks source link

interrupted run restart? #114

Closed davidaray closed 7 months ago

davidaray commented 1 year ago

We had an interruption in the system during day four of a long run. The core was dumped and the run was stopped. Last message was as follows:

genes : 98%|██████████████▋| 39249/40111 [93:17:02<32:16:39, 134.80s/it] process : 99%|███████████████████████▋| 77/78 [93:17:02<02:16, 136.41s/it]

genes : 98%|██████████████▋| 39250/40111 [93:19:17<32:13:52, 134.77s/it] process : 99%|███████████████████████▋| 77/78 [93:19:17<02:16, 136.41s/it]

genes : 98%|██████████████▋| 39251/40111 [93:21:33<32:16:57, 135.14s/it] process : 99%|███████████████████████▋| 77/78 [93:21:33<02:16, 136.41s/it]/var/spool/slurmd/job6413482/slurm_script: line 43: 35083 Bus error (core dumped) python $PROGRAMDIR/process_genome.py $GENEDATA $TEDATA $GENOME -c $PROGRAMDIR/config/production_run_config.ini -n 36 -o $DIR

As you can see, it was nearly finished with whatever process it was performing.

Is it possible to restart this and have it pick up where the process left off or am I stuck starting from the beginning again.

[Please say it's option one.]

sjteresi commented 1 year ago

Unfortunately, you need to restart the process.

sjteresi commented 1 year ago

The only time you save is that the revised TE set doesn't need to be re-made.

davidaray commented 1 year ago

I restarted the process and it ran for over five days this time. However, I checked the run this morning and it appears to have encountered and end of file error.

I can't find a clue as to what file may be causing the problem.

Sorry to be a bother again but can you help with this one?

David

genes       : 100%|██████████████████| 40111/40111 [125:32:18<00:00, 11.27s/it]
2022-08-02 06:46:02 cpu-19-11 __main__[641] INFO processed 78 overlap jobs
2022-08-02 06:46:02 cpu-19-11 __main__[641] INFO process overlap... complete
2022-08-02 06:46:02 cpu-19-11 __main__[641] INFO process density

subsets:   0%|                                          | 0/468 [00:00<?, ?it/s]
subsets:   0%|                                  | 1/468 [00:00<06:18,  1.23it/s]
subsets:   2%|▌                                 | 8/468 [00:00<04:23,  1.75it/s]
subsets:   3%|▉                                | 13/468 [00:01<03:05,  2.45it/s]
subsets:   3%|█▏                               | 16/468 [00:01<02:16,  3.31it/s]
subsets:   5%|█▌                               | 22/468 [00:01<01:37,  4.59it/s]
subsets:   6%|█▊                               | 26/468 [00:01<01:14,  5.94it/s]
subsets:   7%|██▏                              | 31/468 [00:01<00:57,  7.60it/s]
subsets:   8%|██▌                              | 36/468 [00:01<00:44,  9.67it/s]
subsets:   9%|██▉                              | 42/468 [00:02<00:33, 12.56it/s]
subsets:  10%|███▏                             | 46/468 [00:02<00:31, 13.57it/s]
subsets:  10%|███▍                             | 49/468 [00:02<00:26, 15.53it/s]
subsets:  11%|███▋                             | 52/468 [00:02<00:25, 16.32it/s]
subsets:  12%|████                             | 57/468 [00:02<00:20, 19.59it/s]
subsets:  13%|████▎                            | 61/468 [00:02<00:19, 20.51it/s]
subsets:  14%|████▌                            | 64/468 [00:03<00:19, 20.73it/s]
subsets:  14%|████▋                            | 67/468 [00:03<00:20, 19.66it/s]
subsets:  15%|█████                            | 71/468 [00:03<00:18, 22.00it/s]
subsets:  16%|█████▏                           | 74/468 [00:03<00:22, 17.86it/s]
subsets:  17%|█████▋                           | 81/468 [00:03<00:17, 22.12it/s]
subsets:  18%|██████                           | 86/468 [00:03<00:14, 26.00it/s]
subsets:  19%|██████▎                          | 90/468 [00:03<00:13, 28.79it/s]
subsets:  20%|██████▋                          | 94/468 [00:04<00:14, 26.21it/s]
subsets:  21%|██████▉                          | 98/468 [00:04<00:13, 28.43it/s]
subsets:  22%|██████▉                         | 102/468 [00:04<00:20, 18.25it/s]
subsets:  22%|███████▏                        | 105/468 [00:04<00:19, 18.30it/s]
subsets:  23%|███████▍                        | 109/468 [00:04<00:16, 21.32it/s]
subsets:  25%|███████▊                        | 115/468 [00:05<00:13, 26.07it/s]
subsets:  26%|████████▏                       | 120/468 [00:05<00:11, 29.72it/s]
subsets:  27%|████████▌                       | 125/468 [00:05<00:10, 33.40it/s]
subsets:  29%|█████████▏                      | 134/468 [00:05<00:08, 40.54it/s]
subsets:  30%|█████████▌                      | 140/468 [00:05<00:09, 36.37it/s]
subsets:  31%|█████████▉                      | 146/468 [00:05<00:08, 36.03it/s]
subsets:  32%|██████████▎                     | 151/468 [00:05<00:09, 34.77it/s]
subsets:  33%|██████████▌                     | 155/468 [00:06<00:09, 34.78it/s]
subsets:  34%|██████████▊                     | 159/468 [00:06<00:15, 19.90it/s]
subsets:  35%|███████████                     | 162/468 [00:06<00:14, 21.70it/s]
subsets:  35%|███████████▎                    | 165/468 [00:06<00:14, 20.33it/s]
subsets:  36%|███████████▍                    | 168/468 [00:07<00:19, 15.22it/s]
subsets:  37%|███████████▋                    | 171/468 [00:07<00:17, 16.95it/s]
subsets:  37%|███████████▉                    | 174/468 [00:07<00:24, 12.04it/s]
subsets:  38%|████████████                    | 177/468 [00:07<00:21, 13.39it/s]
subsets:  38%|████████████▎                   | 180/468 [00:07<00:21, 13.32it/s]
subsets:  39%|████████████▌                   | 183/468 [00:08<00:18, 15.73it/s]
subsets:  40%|████████████▋                   | 185/468 [00:08<00:17, 16.43it/s]
subsets:  40%|████████████▊                   | 188/468 [00:08<00:15, 18.24it/s]
subsets:  41%|█████████████                   | 191/468 [00:08<00:16, 16.30it/s]
subsets:  41%|█████████████▏                  | 193/468 [00:08<00:20, 13.47it/s]
subsets:  42%|█████████████▎                  | 195/468 [00:08<00:20, 13.62it/s]
subsets:  42%|█████████████▍                  | 197/468 [00:09<00:25, 10.48it/s]
subsets:  43%|█████████████▌                  | 199/468 [00:09<00:27,  9.95it/s]
subsets:  43%|█████████████▊                  | 202/468 [00:09<00:21, 12.33it/s]
subsets:  44%|█████████████▉                  | 204/468 [00:09<00:29,  9.00it/s]
subsets:  44%|██████████████                  | 206/468 [00:10<00:32,  8.04it/s]
subsets:  44%|██████████████▏                 | 208/468 [00:10<00:45,  5.75it/s]
subsets:  45%|██████████████▎                 | 209/468 [00:10<00:44,  5.87it/s]
subsets:  45%|██████████████▍                 | 211/468 [00:11<00:37,  6.92it/s]
subsets:  45%|██████████████▍                 | 212/468 [00:11<00:39,  6.56it/s]
subsets:  46%|██████████████▌                 | 213/468 [00:11<00:41,  6.09it/s]
subsets:  46%|██████████████▋                 | 214/468 [00:11<00:58,  4.33it/s]
subsets:  46%|██████████████▋                 | 215/468 [00:12<00:52,  4.82it/s]
subsets:  46%|██████████████▊                 | 216/468 [00:12<01:09,  3.65it/s]
subsets:  46%|██████████████▊                 | 217/468 [00:12<00:55,  4.49it/s]
subsets:  47%|██████████████▉                 | 218/468 [00:12<00:52,  4.72it/s]
subsets:  47%|██████████████▉                 | 219/468 [00:12<00:49,  4.99it/s]
subsets:  47%|███████████████                 | 220/468 [00:13<00:50,  4.91it/s]
subsets:  47%|███████████████▏                | 222/468 [00:13<00:44,  5.51it/s]
subsets:  48%|███████████████▏                | 223/468 [00:13<00:41,  5.92it/s]
subsets:  48%|███████████████▍                | 225/468 [00:13<00:34,  7.00it/s]
subsets:  48%|███████████████▍                | 226/468 [00:13<00:37,  6.38it/s]
subsets:  49%|███████████████▌                | 227/468 [00:14<01:02,  3.85it/s]
subsets:  49%|███████████████▋                | 229/468 [00:14<00:53,  4.43it/s]
subsets:  49%|███████████████▋                | 230/468 [00:15<01:26,  2.76it/s]
subsets:  49%|███████████████▊                | 231/468 [00:15<01:11,  3.31it/s]
subsets:  50%|███████████████▉                | 233/468 [00:15<00:58,  4.01it/s]
subsets:  50%|████████████████                | 235/468 [00:15<00:44,  5.19it/s]
subsets:  50%|████████████████▏               | 236/468 [00:16<00:50,  4.58it/s]
subsets:  51%|████████████████▏               | 237/468 [00:16<01:00,  3.79it/s]
subsets:  51%|████████████████▎               | 238/468 [00:16<01:00,  3.83it/s]
subsets:  51%|████████████████▍               | 240/468 [00:17<00:52,  4.31it/s]
subsets:  51%|████████████████▍               | 241/468 [00:17<00:57,  3.92it/s]
subsets:  52%|████████████████▌               | 243/468 [00:17<00:59,  3.79it/s]
subsets:  52%|████████████████▊               | 245/468 [00:18<00:45,  4.85it/s]
subsets:  53%|████████████████▊               | 246/468 [00:19<01:38,  2.26it/s]
subsets:  53%|████████████████▉               | 248/468 [00:19<01:18,  2.82it/s]
subsets:  53%|█████████████████               | 249/468 [00:20<01:34,  2.31it/s]
subsets:  53%|█████████████████               | 250/468 [00:20<01:37,  2.23it/s]
subsets:  54%|█████████████████▏              | 252/468 [00:21<01:35,  2.25it/s]
subsets:  54%|█████████████████▎              | 253/468 [00:21<01:37,  2.21it/s]
subsets:  54%|█████████████████▎              | 254/468 [00:22<01:22,  2.61it/s]
subsets:  54%|█████████████████▍              | 255/468 [00:22<01:19,  2.69it/s]
subsets:  55%|█████████████████▌              | 256/468 [00:22<01:03,  3.32it/s]
subsets:  55%|█████████████████▌              | 257/468 [00:23<01:18,  2.68it/s]
subsets:  56%|█████████████████▊              | 260/468 [00:23<00:58,  3.56it/s]
subsets:  56%|█████████████████▊              | 261/468 [00:24<01:33,  2.21it/s]
subsets:  56%|█████████████████▉              | 262/468 [00:24<01:27,  2.35it/s]
subsets:  56%|█████████████████▉              | 263/468 [00:25<01:48,  1.88it/s]
subsets:  56%|██████████████████              | 264/468 [00:26<02:21,  1.45it/s]
subsets:  57%|██████████████████              | 265/468 [00:27<02:29,  1.36it/s]
subsets:  57%|██████████████████▏             | 266/468 [00:27<01:52,  1.80it/s]
subsets:  57%|██████████████████▎             | 267/468 [00:28<02:09,  1.55it/s]
subsets:  57%|██████████████████▎             | 268/468 [00:28<01:42,  1.94it/s]
subsets:  57%|██████████████████▍             | 269/468 [00:28<01:26,  2.31it/s]
subsets:  58%|██████████████████▍             | 270/468 [00:29<01:32,  2.15it/s]
subsets:  58%|██████████████████▌             | 271/468 [00:30<01:55,  1.71it/s]
subsets:  58%|██████████████████▌             | 272/468 [00:30<01:29,  2.20it/s]
subsets:  58%|██████████████████▋             | 273/468 [00:30<01:39,  1.97it/s]
subsets:  59%|██████████████████▋             | 274/468 [00:31<01:25,  2.28it/s]
subsets:  59%|██████████████████▊             | 276/468 [00:31<01:10,  2.73it/s]
subsets:  59%|██████████████████▉             | 277/468 [00:33<02:26,  1.30it/s]
subsets:  59%|███████████████████             | 278/468 [00:33<02:01,  1.57it/s]
subsets:  60%|███████████████████             | 279/468 [00:33<01:34,  1.99it/s]
subsets:  60%|███████████████████▏            | 280/468 [00:33<01:18,  2.40it/s]
subsets:  60%|███████████████████▏            | 281/468 [00:35<01:56,  1.60it/s]
subsets:  60%|███████████████████▎            | 282/468 [00:35<01:35,  1.95it/s]
subsets:  60%|███████████████████▎            | 283/468 [00:35<01:25,  2.17it/s]
subsets:  61%|███████████████████▍            | 285/468 [00:36<01:10,  2.58it/s]
subsets:  61%|███████████████████▌            | 286/468 [00:36<00:59,  3.06it/s]
subsets:  61%|███████████████████▌            | 287/468 [00:36<01:12,  2.48it/s]
subsets:  62%|███████████████████▋            | 288/468 [00:38<02:08,  1.40it/s]
subsets:  62%|███████████████████▉            | 291/468 [00:38<01:32,  1.90it/s]
subsets:  62%|███████████████████▉            | 292/468 [00:39<01:41,  1.73it/s]
subsets:  63%|████████████████████            | 293/468 [00:39<01:17,  2.25it/s]
subsets:  63%|████████████████████            | 294/468 [00:39<01:09,  2.50it/s]
subsets:  63%|████████████████████▏           | 295/468 [00:39<00:55,  3.11it/s]
subsets:  63%|████████████████████▏           | 296/468 [00:40<00:51,  3.36it/s]
subsets:  63%|████████████████████▎           | 297/468 [00:40<01:01,  2.79it/s]
subsets:  64%|████████████████████▍           | 298/468 [00:40<00:56,  3.02it/s]
subsets:  64%|████████████████████▍           | 299/468 [00:40<00:45,  3.73it/s]
subsets:  64%|████████████████████▌           | 301/468 [00:41<00:37,  4.49it/s]
subsets:  65%|████████████████████▋           | 303/468 [00:42<01:05,  2.52it/s]
subsets:  65%|████████████████████▊           | 304/468 [00:43<01:34,  1.74it/s]
subsets:  65%|████████████████████▉           | 306/468 [00:44<01:14,  2.17it/s]
subsets:  66%|████████████████████▉           | 307/468 [00:44<01:03,  2.54it/s]
subsets:  66%|█████████████████████           | 308/468 [00:48<04:01,  1.51s/it]
subsets:  66%|█████████████████████▏          | 309/468 [00:49<03:50,  1.45s/it]
subsets:  66%|█████████████████████▏          | 310/468 [00:50<02:53,  1.10s/it]
subsets:  66%|█████████████████████▎          | 311/468 [00:51<02:57,  1.13s/it]
subsets:  67%|█████████████████████▎          | 312/468 [00:51<02:17,  1.14it/s]
subsets:  67%|█████████████████████▍          | 313/468 [00:55<04:16,  1.65s/it]
subsets:  67%|█████████████████████▍          | 314/468 [00:57<04:36,  1.80s/it]
subsets:  67%|█████████████████████▌          | 315/468 [00:57<03:18,  1.30s/it]
subsets:  68%|█████████████████████▌          | 316/468 [00:57<02:26,  1.03it/s]
subsets:  68%|█████████████████████▋          | 317/468 [00:58<02:27,  1.02it/s]
subsets:  68%|█████████████████████▋          | 318/468 [00:59<02:08,  1.17it/s]
subsets:  68%|█████████████████████▊          | 319/468 [01:01<03:11,  1.29s/it]
subsets:  68%|█████████████████████▉          | 320/468 [01:01<02:22,  1.04it/s]
subsets:  69%|█████████████████████▉          | 321/468 [01:04<03:34,  1.46s/it]
subsets:  69%|██████████████████████          | 322/468 [01:04<02:43,  1.12s/it]
subsets:  69%|██████████████████████          | 323/468 [01:04<02:08,  1.13it/s]
subsets:  69%|██████████████████████▏         | 324/468 [01:05<02:17,  1.05it/s]
subsets:  69%|██████████████████████▏         | 325/468 [01:08<03:23,  1.43s/it]
subsets:  70%|██████████████████████▎         | 327/468 [01:08<02:28,  1.06s/it]
subsets:  70%|██████████████████████▍         | 328/468 [01:09<02:12,  1.06it/s]
subsets:  70%|██████████████████████▍         | 329/468 [01:10<02:23,  1.03s/it]
subsets:  71%|██████████████████████▌         | 330/468 [01:11<01:47,  1.29it/s]
subsets:  71%|██████████████████████▋         | 331/468 [01:11<01:19,  1.71it/s]
subsets:  71%|██████████████████████▋         | 332/468 [01:11<01:02,  2.16it/s]
subsets:  71%|██████████████████████▊         | 333/468 [01:12<01:21,  1.65it/s]
subsets:  71%|██████████████████████▊         | 334/468 [01:13<01:44,  1.29it/s]
subsets:  72%|██████████████████████▉         | 335/468 [01:13<01:27,  1.52it/s]
subsets:  72%|██████████████████████▉         | 336/468 [01:15<01:56,  1.13it/s]
subsets:  72%|███████████████████████         | 337/468 [01:17<03:00,  1.37s/it]
subsets:  72%|███████████████████████         | 338/468 [01:18<02:25,  1.12s/it]
subsets:  72%|███████████████████████▏        | 339/468 [01:18<02:01,  1.06it/s]
subsets:  73%|███████████████████████▏        | 340/468 [01:22<03:39,  1.72s/it]
subsets:  73%|███████████████████████▎        | 341/468 [01:52<21:54, 10.35s/it]
subsets:  73%|███████████████████████▍        | 342/468 [02:02<21:34, 10.27s/it]
subsets:  73%|███████████████████████▍        | 343/468 [02:07<17:57,  8.62s/it]
subsets:  74%|███████████████████████▌        | 344/468 [02:16<17:39,  8.55s/it]
subsets:  74%|███████████████████████▌        | 345/468 [02:40<27:05, 13.21s/it]
subsets:  74%|███████████████████████▋        | 346/468 [02:50<25:18, 12.44s/it]Exception in thread Thread-11:
Traceback (most recent call last):
  File "/home/daray/conda/envs/tedensity/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/daray/conda/envs/tedensity/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/lustre/work/daray/software/TE_Density/process_genome.py", line 139, in exec
    result = self.queue.get(timeout=0.2)
  File "<string>", line 2, in get
  File "/home/daray/conda/envs/tedensity/lib/python3.8/multiprocessing/managers.py", line 835, in _callmethod
    kind, result = conn.recv()
  File "/home/daray/conda/envs/tedensity/lib/python3.8/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/daray/conda/envs/tedensity/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/home/daray/conda/envs/tedensity/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
davidaray commented 1 year ago

I just noticed that the job is still running. Well, at least the queue says it is. But the output files have not been updated in several hours.

  6455949   xlquanah   98037   R     daray mMyo-dens  6-12:29:24      1     36 xlquanah  cpu-19-11
sjteresi commented 1 year ago

Hi David,

Thanks again for your patience and I apologize that you are having these troubles, I will do my best to help! I must admit this is not an error I have seen before, @teresi and I will begin investigating this.

In the meantime can you confirm that you have all package versions matching with requirements/requirements.txt and that you are able to run at least one of the example data sets? You can download the data from Dryad and run process_genome.py on one of those pairs of cleaned input data.

davidaray commented 1 year ago

What would be the most efficient way to confirm all of the packages in requirements.txt? I was able to run pip install -r requirements/requirements.txt and got no errors but if there is a better way to confirm that everything was installed properly, please let me know.

In the meantime, I will try to run one of the shorter analyses to confirm everything is working properly. Unfortunately, our HPCC is going down for maintenance in a few days and no jobs longer than a couple of days are being accepted at the moment.

davidaray commented 1 year ago

This may help:

$ conda list
# packages in environment at /home/daray/conda/envs/tedensity:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
asttokens                 2.0.5              pyhd8ed1ab_0    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                        py_2    conda-forge
backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
ca-certificates           2022.6.15            ha878542_0    conda-forge
decorator                 5.1.1              pyhd8ed1ab_0    conda-forge
distlib                   0.3.5                    pypi_0    pypi
executing                 0.8.3              pyhd8ed1ab_0    conda-forge
filelock                  3.7.1                    pypi_0    pypi
ipython                   8.4.0            py38h578d9bd_0    conda-forge
jedi                      0.18.1           py38h578d9bd_1    conda-forge
ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 12.1.0              h8d9b700_16    conda-forge
libgomp                   12.1.0              h8d9b700_16    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libuuid                   2.32.1            h7f98852_1000    conda-forge
libzlib                   1.2.12               h166bdaf_2    conda-forge
matplotlib-inline         0.1.3              pyhd8ed1ab_0    conda-forge
ncurses                   6.3                  h27087fc_1    conda-forge
openssl                   3.0.5                h166bdaf_0    conda-forge
parso                     0.8.3              pyhd8ed1ab_0    conda-forge
pexpect                   4.8.0              pyh9f0ad1d_2    conda-forge
pickleshare               0.7.5                   py_1003    conda-forge
pip                       22.1.2             pyhd8ed1ab_0    conda-forge
platformdirs              2.5.2                    pypi_0    pypi
prompt-toolkit            3.0.30             pyha770c72_0    conda-forge
ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
pure_eval                 0.2.2              pyhd8ed1ab_0    conda-forge
pygments                  2.12.0             pyhd8ed1ab_0    conda-forge
python                    3.8.13          ha86cf86_0_cpython    conda-forge
python_abi                3.8                      2_cp38    conda-forge
readline                  8.1.2                h0f457ee_0    conda-forge
setuptools                63.2.0           py38h578d9bd_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sqlite                    3.39.1               h4ff8645_0    conda-forge
stack_data                0.3.0              pyhd8ed1ab_0    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
traitlets                 5.3.0              pyhd8ed1ab_0    conda-forge
virtualenv                20.15.1                  pypi_0    pypi
wcwidth                   0.2.5              pyh9f0ad1d_2    conda-forge
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zlib                      1.2.12               h166bdaf_2    conda-forge
davidaray commented 1 year ago

One more thing. Our HPCC is notorious for being a difficult system to work on. I've been in several arguments with the administration about these exact sorts of things. If I were to bet, it would not be your package that's at fault.

sjteresi commented 1 year ago

Hmm odd. I don't see pandas or numpy in that conda list command. You should be able to check your packages in pip with pip freeze. Usually I do things through a virtualenv and have an environment that I activate. I am unfamiliar with using conda AND pip, I have no idea if this could be causing an issue. This link from the anaconda website seems to provide more information... https://www.anaconda.com/blog/using-pip-in-a-conda-environment

davidaray commented 1 year ago

The Arabidopsis run appears to have finished successfully but I'd like to confirm. What output files should exist?

Here's what I have:

$ ls -R aTha_tedensity
aTha_tedensity:
aTha_Chr1.h5  aTha_Chr3.h5  aTha_Chr5.h5         results
aTha_Chr2.h5  aTha_Chr4.h5  filtered_input_data  tmp

aTha_tedensity/filtered_input_data:
input_h5_cache  revised_input_data

aTha_tedensity/filtered_input_data/input_h5_cache:
aTha_Chr1_GeneData.h5  aTha_Chr3_GeneData.h5  aTha_Chr5_GeneData.h5
aTha_Chr1_TEData.h5    aTha_Chr3_TEData.h5    aTha_Chr5_TEData.h5
aTha_Chr2_GeneData.h5  aTha_Chr4_GeneData.h5
aTha_Chr2_TEData.h5    aTha_Chr4_TEData.h5

aTha_tedensity/filtered_input_data/revised_input_data:
Revised_Cleaned_TAIR10_chr_main_chromosomes.fas.mod.EDTA.TEanno.tsv
aTha_nameless_revision_cache.h5
aTha_order_revision_cache.h5
aTha_superfam_revision_cache.h5

aTha_tedensity/tmp:
overlap

aTha_tedensity/tmp/overlap:
aTha_Chr1_overlap.h5  aTha_Chr3_overlap.h5  aTha_Chr5_overlap.h5
aTha_Chr2_overlap.h5  aTha_Chr4_overlap.h5

The O. glab data are currently running as are the data for only chr13 of human. Obviously, those will take significantly longer.

sjteresi commented 1 year ago

The output files are all present. Looks like that worked without a hitch. The main output files you as a user care about are the final output files, aTha_Chr1.h5, aTha_Chr2.h5 etc

davidaray commented 1 year ago

Interestingly, the O. glab run appears to have stalled. The output file hasn't updated since 2 pm yesterday and the tail end looks like this:

2022-08-02 14:13:30 cpu-25-38 __main__[69756] INFO processed 12 overlap jobs
2022-08-02 14:13:30 cpu-25-38 __main__[69756] INFO process overlap... complete
2022-08-02 14:13:30 cpu-25-38 __main__[69756] INFO process density

subsets:   0%|                                           | 0/72 [00:00<?, ?it/s]
subsets:   1%|▍                                  | 1/72 [00:17<21:09, 17.87s/it]
subsets:   3%|▉                                  | 2/72 [00:22<16:22, 14.04s/it]
subsets:   4%|█▍                                 | 3/72 [00:23<11:26,  9.94s/it]
subsets:   6%|█▉                                 | 4/72 [00:38<13:00, 11.48s/it]
subsets:   7%|██▍                                | 5/72 [00:38<09:04,  8.13s/it]
subsets:   8%|██▉                                | 6/72 [00:49<09:41,  8.81s/it]
subsets:  10%|███▍                               | 7/72 [01:11<13:48, 12.75s/it]
subsets:  11%|███▉                               | 8/72 [01:19<12:15, 11.49s/it]
subsets:  12%|████▍                              | 9/72 [01:44<16:11, 15.42s/it]
subsets:  14%|████▋                             | 10/72 [01:53<14:04, 13.61s/it]
subsets:  15%|█████▏                            | 11/72 [02:08<14:12, 13.98s/it]
subsets:  17%|█████▋                            | 12/72 [02:18<12:50, 12.84s/it]
subsets:  18%|██████▏                           | 13/72 [02:30<12:27, 12.67s/it]
subsets:  19%|██████▌                           | 14/72 [02:41<11:36, 12.02s/it]
subsets:  21%|███████                           | 15/72 [02:57<12:37, 13.29s/it]
subsets:  22%|███████▌                          | 16/72 [03:14<13:18, 14.26s/it]
subsets:  24%|████████                          | 17/72 [03:15<09:26, 10.31s/it]
subsets:  25%|████████▌                         | 18/72 [03:36<12:20, 13.72s/it]
subsets:  26%|████████▉                         | 19/72 [03:48<11:34, 13.10s/it]
subsets:  28%|█████████▍                        | 20/72 [04:38<21:02, 24.28s/it]
subsets:  29%|█████████▉                        | 21/72 [04:46<16:17, 19.16s/it]
subsets:  31%|██████████▍                       | 22/72 [05:38<24:20, 29.21s/it]
subsets:  32%|██████████▊                       | 23/72 [05:42<17:33, 21.51s/it]
subsets:  33%|███████████▎                      | 24/72 [05:50<13:53, 17.37s/it]
subsets:  35%|███████████▊                      | 25/72 [06:19<16:27, 21.00s/it]
subsets:  36%|████████████▎                     | 26/72 [06:56<19:46, 25.80s/it]
subsets:  38%|████████████▊                     | 27/72 [07:25<20:04, 26.76s/it]
subsets:  39%|█████████████▏                    | 28/72 [09:09<36:36, 49.92s/it]

But, there are output files.

$ ls -R
.:
filtered_input_data  oGlab_1.h5  oGlab_10.h5  oGlab_11.h5  oGlab_12.h5  oGlab_2.h5  oGlab_3.h5  oGlab_4.h5  oGlab_5.h5  oGlab_6.h5  oGlab_7.h5  oGlab_8.h5  oGlab_9.h5  tmp

./filtered_input_data:
input_h5_cache  revised_input_data

./filtered_input_data/input_h5_cache:
oGlab_10_GeneData.h5  oGlab_11_TEData.h5    oGlab_1_GeneData.h5  oGlab_2_TEData.h5    oGlab_4_GeneData.h5  oGlab_5_TEData.h5    oGlab_7_GeneData.h5  oGlab_8_TEData.h5
oGlab_10_TEData.h5    oGlab_12_GeneData.h5  oGlab_1_TEData.h5    oGlab_3_GeneData.h5  oGlab_4_TEData.h5    oGlab_6_GeneData.h5  oGlab_7_TEData.h5    oGlab_9_GeneData.h5
oGlab_11_GeneData.h5  oGlab_12_TEData.h5    oGlab_2_GeneData.h5  oGlab_3_TEData.h5    oGlab_5_GeneData.h5  oGlab_6_TEData.h5    oGlab_8_GeneData.h5  oGlab_9_TEData.h5

./filtered_input_data/revised_input_data:
Revised_Cleaned_Oryza_Glaberrima_NewNames.fasta.mod.EDTA.TEanno.tsv  oGlab_nameless_revision_cache.h5  oGlab_order_revision_cache.h5  oGlab_superfam_revision_cache.h5

./tmp:
overlap

./tmp/overlap:
oGlab_10_overlap.h5  oGlab_12_overlap.h5  oGlab_2_overlap.h5  oGlab_4_overlap.h5  oGlab_6_overlap.h5  oGlab_8_overlap.h5
oGlab_11_overlap.h5  oGlab_1_overlap.h5   oGlab_3_overlap.h5  oGlab_5_overlap.h5  oGlab_7_overlap.h5  oGlab_9_overlap.h5

I'm not sure how to interpret this.

David

teresi commented 1 year ago

Hello David,

This may take further debugging so thank you for your patience.

First to elucidate:

...it appears to have encountered and end of file error.

File "/home/daray/conda/envs/tedensity/lib/python3.8/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError

As I understand: multiprocessing/connection.py::_recv raises EOFError when there are no further items to receive.

see https://stackoverflow.com/questions/25994201/python-multiprocessing-queue-error see https://github.com/python/cpython/blob/330f1d58282517bdf1f19577ab9317fa9810bf95/Lib/multiprocessing/connection.py#L378

I can't find a clue as to what file may be causing the problem.

The file it refers to is the internal resource that the manager is using to communicate between the processes, i.e. the "arbitrary file descriptor", not any of your files.

I'm not sure how to interpret [the output files from the stalled run].

The files may have been created but not been populated. You can view the file ing HDFView; entries should be missing if it failed partway.

Second my hypotheses:

Given the stackoverflow post, there may be a race condition where the Manager is getting shut down but other references to the Queue are still in use. The context managers should have killed the reader prior to the Manager so this is puzzling. Nonetheless I am adding a task_done call to the reader of the Queue as best practice and a context manager call to the Manager itself. Please wait for further developments on this commit.

Given that your call hung, my best guess is that one of your processors crashed and then caused this behavior. As @sjteresi mentioned in other issues running out of RAM has causes similar stalling behavior. When you observed your job, was it using any cpu resources? (It may be appropriate for us on a future commit to kill process_genome.py if this exception is encountered, or to add more exception handling to MergeData.sum)

Finally my recommendations:

I'd like to be able to reproduce the error.

As @sjteresi mentions, if you are running both pip and conda it may cause problems. Would you please create a new virtual environment using only pip? I'm currently on 3.8.0 and recommend virtuenvwrapper. see https://virtualenvwrapper.readthedocs.io/en/latest/command_ref.html

Since your were able to run the Arabidopsis test I'd like to try a portion of your dataset. Would you give us a 1 chromosome subset of your data? A Google drive link would be best; you can email us that if you'd like it to be private or etc.

davidaray commented 1 year ago

Thanks Michael. This is all useful information. I'd be happy to pursue all of these options to solve the problem but there are two obstacles. First, I'm out of the office at a conference until Friday and won't have much time to work on this. Second, even if I did, our HPCC is down for maintenance starting today and extending through Friday. So, the earliest I'll be able to send you the data subset or reinstall would be Monday of next week. But. I'll take care of it then. Thanks again.

davidaray commented 1 year ago

Thanks for your patience. I got back from my conference over the weekend and generated the files you requested this morning. The link goes to a .tgz file containing the TE and gene data from one scaffold of our assembly.

https://drive.google.com/file/d/1nXU0-Bt5UcZjtWrPa02Ks5ieGpNFo303/view?usp=sharing

I will spend part of today reinstalling tedensity with a pip only and get back to you.

I look foward to seeing what happens when you try to run my data.

davidaray commented 1 year ago

I was able to reinstall using pip alone and got several processes moving. One of them was able to restart using the pre-filtered TE file and immediately jumped to one of the downstream processes.

...
2022-08-15 11:18:46 cpu-19-3 __main__[35958] INFO preprocessed 78 files to filtered_input_data/input_h5_cache
2022-08-15 11:18:46 cpu-19-3 __main__[35958] INFO preprocessing... complete
2022-08-15 11:18:46 cpu-19-3 __main__[35958] INFO process overlap...
2022-08-15 11:18:46 cpu-19-3 OverlapManager[35958] INFO output overlap data to /lustre/scratch/daray/tedensity/mMyo_tedensity2/tmp/overlap

process     : 0it [00:00, ?it/s]

genes       : 0it [00:00, ?it/s]
process     : 0it [00:00, ?it/s]

genes       : 0it [00:00, ?it/s]
2022-08-15 11:18:53 cpu-19-3 __main__[35958] INFO processed 78 overlap jobs
2022-08-15 11:18:53 cpu-19-3 __main__[35958] INFO process overlap... complete
2022-08-15 11:18:53 cpu-19-3 __main__[35958] INFO process density

subsets:   0%|                                          | 0/468 [00:00<?, ?it/s]
subsets:   0%|                                  | 1/468 [00:00<06:23,  1.22it/s]
subsets:   1%|▏                                 | 3/468 [00:00<04:39,  1.67it/s]
subsets:   1%|▎                                 | 4/468 [00:01<05:03,  1.53it/s]
...
subsets:  93%|█████████████████████████▏ | 437/468 [1:24:25<2:42:36, 314.72s/it]
subsets:  94%|█████████████████████████▎ | 438/468 [1:28:18<2:25:00, 290.02s/it]
subsets:  94%|█████████████████████████▎ | 439/468 [1:29:19<1:46:57, 221.30s/it]

Unfortunately, that last line is where it has been stuck since yesterday at 1 pm.

I recalled back to the comment about RAM and stalling from a previous comment and investigated using top.

While the output of top is variable. I do see python coming up at times:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 1859 root      20   0       0      0      0 I   0.3   0.0   0:00.89 kworker/u72:3-ipoib_wq
10420 daray     20   0   51196   4304   3108 R   0.3   0.0   0:01.56 top
36196 daray     20   0 3481024 466148   1416 S   0.3   0.2   1:31.91 python

I don't know how to interpret this, though.

sjteresi commented 1 year ago

Hello David,

Thank you for the data. Michael and I will try to investigate it soon, in the meantime I am going to see if I can get your data working on MSU's HPCC.

Maybe @teresi can chime in regarding the output of top.

In regards to the program stalling, aside from giving the command more RAM have you investigated condensing some TE groupings into one another?

davidaray commented 1 year ago

I haven't gone as far as to condense yet. I was hoping to get it to work without going through that process, which can be a pain. But, if necessary....

sjteresi commented 1 year ago

How many unique TE superfamilies do you have?

davidaray commented 1 year ago

112 for the this species. 5S 5S-Core-RTE 5S-Deu-L2 5S-RTE 5S-Sauria-RTE Academ-1 Alu B2 B4 Bhikhari BovB CACTA CMC-EnSpm CR1 Copia Core-RTE Crypton Crypton-A Crypton-H Crypton-V DIRS DNA Dada Dong-R4 ERV ERV-Lenti ERV1 ERV1_internal ERV1_ltr ERV4 ERVK ERVK_internal ERVK_ltr ERVL ERVL-MaLR ERVL_internal ERVL_ltr Ginger Gypsy HAL Helitron I I-Jockey ID IS3EU Jockey Kolobok Kolobok-T2 L1 L1-Tx1 L2 LINE LTR MIR MULE-MuDR MULE-NOF Maverick Meg Merlin Mutator Ngaro Novosib P PIF-Harbinger PIF-ISL2EU Pao Penelope PiggyBac R1 R2 R2-Hero R2-NeSL RTE RTE-BovB RTE-RTE RTE-X Rex-Babar Rhin SINE Sola-1 SuperFamily TcMar TcMar-Fot1 TcMar-ISRm11 TcMar-Mariner TcMar-Stowaway TcMar-Tc1 TcMar-Tc2 TcMar-Tigger TcMariner U Ves Zisupton hAT hAT-Blackjack hAT-Charlie hAT-Tip100 piggyBac tRNA tRNA-CR1 tRNA-Core tRNA-Core-L2 tRNA-Core-RTE tRNA-Deu tRNA-Deu-L2 tRNA-L1 tRNA-L2 tRNA-RTE tRNA-V tRNA-V-CR1 tRNA-V-Core-L2 unknown

sjteresi commented 1 year ago

Wow that is a lot. However I agree, let's see if we can get it working with the whole dataset before we start talking about condensing groups

sjteresi commented 1 year ago

However, if you do decide to condense groups I can send you some updated Python code from another project where I condensed a lot of these TEs into similar groups (e.g like all of the tRNA into one category). I had to filter a similar animal genome for another project. That at least might make things easier for you

davidaray commented 1 year ago

That would be be helpful. This is a particularly complex mammal genome that contains quite a few TEs that other mammals do not have.

davidaray commented 1 year ago

Sorry for the long silence. Classes have started and I'm a newly minted associate chair of the department.

Progress has been made. Sort of. I've had four runs going for over 11 days. These were all started after reducing the complexity of the TE files using methods similar to those in the python script. This did simplify them substantially. I had also found an error in my input files. Some scaffolds were represented twice. For example, there would be two entries for every TE annotation on scaffold_1. I removed that problem before starting all of these runs as well.

The end result is similar, however. All of these have run for over 11 days and are generating output files. However, those output files have not updated in 2-3 days. All of them are still shown as running in the queue but they all stopped updating as late as the 4th of September at 8:13 pm.

At the end of those output files, I see this in one of them: subsets: 49%|███████████████▌ | 143/294 [00:32<02:42, 1.08s/it] subsets: 49%|███████████████▋ | 144/294 [00:32<02:26, 1.02it/s]Exception in thread Thread-12: Traceback (most recent call last): File "/home/daray/conda/envs/tedensity/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/home/daray/conda/envs/tedensity/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/daray/conda/envs/tedensity/lib/python3.8/multiprocessing/pool.py", line 513, in _handle_workers cls._maintain_pool(ctx, Process, processes, pool, inqueue, File "/home/daray/conda/envs/tedensity/lib/python3.8/multiprocessing/pool.py", line 337, in _maintain_pool Pool._repopulate_pool_static(ctx, Process, processes, pool, File "/home/daray/conda/envs/tedensity/lib/python3.8/multiprocessing/pool.py", line 326, in _repopulate_pool_static w.start() File "/home/daray/conda/envs/tedensity/lib/python3.8/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/home/daray/conda/envs/tedensity/lib/python3.8/multiprocessing/context.py", line 277, in _Popen return Popen(process_obj) File "/home/daray/conda/envs/tedensity/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/home/daray/conda/envs/tedensity/lib/python3.8/multiprocessing/popen_fork.py", line 70, in _launch self.pid = os.fork() OSError: [Errno 12] Cannot allocate memory

subsets: 49%|███████████████▊ | 145/294 [00:57<20:05, 8.09s/it] subsets: 50%|███████████████▉ | 146/294 [00:58<14:31, 5.89s/it] subsets: 50%|████████████████ | 147/294 [01:00<12:05, 4.93s/it] subsets: 50%|████████████████ | 148/294 [01:06<12:46, 5.25s/it] subsets: 51%|████████████████▏ | 149/294 [01:07<09:29, 3.93s/it] subsets: 51%|████████████████▎ | 150/294 [01:08<07:24, 3.09s/it] subsets: 51%|████████████████▍ | 151/294 [01:15<09:43, 4.08s/it] subsets: 52%|████████████████▌ | 152/294 [01:17<08:31, 3.60s/it] subsets: 52%|████████████████▋ | 153/294 [01:18<06:16, 2.67s/it]Exception in thread Thread-11: Traceback (most recent call last): File "/home/daray/conda/envs/tedensity/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/home/daray/conda/envs/tedensity/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/lustre/work/daray/software/TE_Density/process_genome.py", line 139, in exec result = self.queue.get(timeout=0.2) File "", line 2, in get File "/home/daray/conda/envs/tedensity/lib/python3.8/multiprocessing/managers.py", line 835, in _callmethod kind, result = conn.recv() File "/home/daray/conda/envs/tedensity/lib/python3.8/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/home/daray/conda/envs/tedensity/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/home/daray/conda/envs/tedensity/lib/python3.8/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError

Others have no errors or different errors.

In all cases, however, there appears to be a full complement of .h5 files. Overall, it's very confusing.

sjteresi commented 1 year ago

Seems like from that OS error the process ran out of memory. How much memory are you using per pseudomolecule?

Referencing our email discussion about the large fish genome I am running things on, I had to throw a lot of RAM (60G) per pseudomolecule and it ran within 48 hours. In my experience it is sometimes better to wait longer on the computing cluster scheduling queue and ask for a lot of resources.

I am sorry that there continues to be issues. If we are still unable to get this working after this latest iteration of runs, I could always try to run your data on MSU's system? I don't think that would be difficult at all to port it over.

sjteresi commented 1 year ago

Hello @davidaray

I apologize for not checking in sooner, late September and all of October was pretty crazy; do you have any updates? Were you able to get things working?

davidaray commented 1 year ago

I've also been very busy. I recently took on the position of associate chair of our department and that has been much more time consuming than I anticipated.

I do not have an update. Further, our HPCC is currently down for maintenance, so I can't even check to see what the results of the last run were.

I hope to get back to this during the holidays.

Thanks for checking in. I definitely want to get it to work.

David

On Thu, Nov 3, 2022 at 12:29 PM Scott Teresi @.***> wrote:

Hello @davidaray https://github.com/davidaray

I apologize for not checking in sooner, late September and all of October was pretty crazy; do you have any updates? Were you able to get things working?

— Reply to this email directly, view it on GitHub https://github.com/sjteresi/TE_Density/issues/114#issuecomment-1302445413, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCH4FW5TRBJNNIRMA6FVPLWGPY73ANCNFSM54XLQ7QQ . You are receiving this because you were mentioned.Message ID: @.***>

--

David A. Ray Professor and Associate Chair, Department of Biological Sciences Texas Tech University President's Academic Achievement Award Website - http://www.davidraylab.com Github - https://github.com/davidaray/ Twitter - @RayLabTTU Skype - david.a.ray

Department of Biological Sciences Texas Tech University Phone: (806) 834-1677


Even the best of us have bad days. "I am very poorly today and very stupid and hate everyone and everything."

sjteresi commented 1 year ago

No problem David. I am no stranger to the trials and tribulations of running stuff on the computing cluster. Don't hesitate to ask if you need help.

Michael and I are still working on addressing memory efficiency and updating the software's packages. Hopefully more on that in the next few months...