transientskp / tkp

A transients-discovery pipeline for astronomical image-based surveys
http://docs.transientskp.org/
BSD 2-Clause "Simplified" License
19 stars 15 forks source link

TraP Crashing Consistently on an AARTFAAC Image #486

Closed ycendes closed 8 years ago

ycendes commented 9 years ago

Earlier this week, I was running two sidebands of ~37k AARTFAAC images through TraP. One of these ran through just fine, but the other is consistently crashing at the same image, as can be seen in the database here- http://banana.transientskp.org/master/vlo_SB002cendes/dataset/1/

At first we thought this was because vlo was crashing, but it's now cleared out and has plenty of space and this happens even when I set up a new database and job from scratch. Further, the image itself where this happens appears pretty normal- it's at struis:/scratch/fhuizing/aartfaac/results/24h/F5.71304e+07_S1-63_T20-11-2013_14-39-23.image or struis:/scratch/fhuizing/aartfaac/results/24h/F5.71304e+07_S1-63_T20-11-2013_14-39-24.image, where we see there is a RFI source in the image that turns off, but nothing really weird otherwise appears.

Further, there is nothing weird in the trap.log file when you look at it, even for the new one where everything was set up properly- it just stops. That log file etc is available here: struis:/scratch/ycendes/aartfaac/cendes-pipeline/SB003/SB002cendes/logs/2015-10-28T17:47:41

I don't have the printout of the screen from when it crashes as I was running it overnight, but am now running this batch of data in serial mode and will post what happens here once it finishes this weekend (it will take some hours). But wanted to file this issue now while it was still fresh on my mind.

ycendes commented 9 years ago

Update, I ran this data set in serial over the weekend and it crashed before the sourcefinder step- http://banana.transientskp.org/master/vlo_SB002cendes/dataset/2/

Checking out the log, it appears the failure was during the quality control step, which if true is strange because there isn't supposed to be QC for AARTFAAC images. Logs are here (along with an output.txt of the stuff printed in the terminal window)- /scratch/ycendes/aartfaac/cendes-pipeline/SB003/SB002cendes/logs/2015-10-30T15:02:56

AntoniaR commented 9 years ago

There's no output.txt in that folder... Where did you put it?

ycendes commented 9 years ago

Sorry, I meant log.txt was the output file.

AntoniaR commented 9 years ago

Did you also look at the terminal output on the terminal? If you capture the screen output with trap-manage.py run test > log.txt it can miss some of the error messages.

jdswinbank commented 9 years ago

You shouldn't miss any output if you capture both stdout and stderr:

$ trap-manage.py run test > log.txt 2>&1
ycendes commented 9 years ago

Ok, thanks! I'll set it up again to run with that command, as I don't have the output on the terminal (it takes several hours to get that far, and doesn't seem to show up at the end in screen).

ycendes commented 9 years ago

Ok! So, this time I was more successful, this run (http://banana.transientskp.org/master/vlo_SB002cendes/dataset/3/) looks the same as the earlier ones on banana and the crash is this:

/scratch/fhuizing/aartfaac/results/24h/SB002/F5.71304e+07_S1-63_T21-11-2013_00-04-59.image
*************************
Traceback (most recent call last):
  File "/home/ycendes/aartfaacenv/bin/trap-manage.py", line 5, in <module>
    pkg_resources.run_script('tkp==2.2a0', 'trap-manage.py')
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 528, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1394, in run_script
    execfile(script_filename, namespace, namespace)
  File "/home/ycendes/aartfaacenv/lib/python2.7/site-packages/tkp-2.2a0-py2.7.egg/EGG-INFO/scripts/trap-manage.py", line 10, in <module>
    tkp.management.main()
  File "/home/ycendes/aartfaacenv/local/lib/python2.7/site-packages/tkp-2.2a0-py2.7.egg/tkp/management.py", line 323, in main
    args.func(args)
  File "/home/ycendes/aartfaacenv/local/lib/python2.7/site-packages/tkp-2.2a0-py2.7.egg/tkp/management.py", line 223, in run_job
    run(args.name, monitor_coords)
  File "/home/ycendes/aartfaacenv/local/lib/python2.7/site-packages/tkp-2.2a0-py2.7.egg/tkp/main.py", line 138, in run
    extraction_results = runner.map("extract_sources", urls, arguments)
  File "/home/ycendes/aartfaacenv/local/lib/python2.7/site-packages/tkp-2.2a0-py2.7.egg/tkp/distribute/__init__.py", line 42, in map
    return self.module.map(func, iterable, args)
  File "/home/ycendes/aartfaacenv/local/lib/python2.7/site-packages/tkp-2.2a0-py2.7.egg/tkp/distribute/serial/__init__.py", line 3, in map
    x = [func(i, *arguments) for i in iterable]
  File "/home/ycendes/aartfaacenv/local/lib/python2.7/site-packages/tkp-2.2a0-py2.7.egg/tkp/distribute/serial/tasks.py", line 22, in extract_sources
    return tkp.steps.source_extraction.extract_sources(url, extraction_params)
  File "/home/ycendes/aartfaacenv/local/lib/python2.7/site-packages/tkp-2.2a0-py2.7.egg/tkp/steps/source_extraction.py", line 50, in extract_sources
    force_beam=extraction_params['force_beam']
  File "/home/ycendes/aartfaacenv/local/lib/python2.7/site-packages/tkp-2.2a0-py2.7.egg/tkp/sourcefinder/image.py", line 401, in extract
    labelled_data=labelled_data, labels=labels
  File "/home/ycendes/aartfaacenv/local/lib/python2.7/site-packages/tkp-2.2a0-py2.7.egg/tkp/sourcefinder/image.py", line 863, in _pyse
    det = extract.Detection(measurement, self, chunk=island.chunk)
  File "/home/ycendes/aartfaacenv/local/lib/python2.7/site-packages/tkp-2.2a0-py2.7.egg/tkp/sourcefinder/extract.py", line 754, in __init__
    self._physical_coordinates()
  File "/home/ycendes/aartfaacenv/local/lib/python2.7/site-packages/tkp-2.2a0-py2.7.egg/tkp/sourcefinder/extract.py", line 820, in _physical_coordinates
    [self.x.value, self.y.value])]
  File "/home/ycendes/aartfaacenv/local/lib/python2.7/site-packages/tkp-2.2a0-py2.7.egg/tkp/utility/coordinates.py", line 668, in p2s
    raise RuntimeError("Spatial position is not a number")
RuntimeError: Spatial position is not a number

(Specifically, the logs etc are here, including this output in log.txt: /scratch/ycendes/aartfaac/cendes-pipeline/SB003/SB002cendes/logs/2015-11-02T16:39:30/ )

... I should note btw that the last image listed here is actually the last image in the data set, and definitely not the image that's the last one in the light curve. So it's not obvious to me why this error pops up when it does?

AntoniaR commented 9 years ago

So this looks like there is something up with the WCS in that image or with how the sourcefinder is interpreting the WCS conversion.

Have you tried loading that image in DS9? Does it look normal? When you move the mouse over the image, are the positions sensible? Does the image header look normal?

Try running PySE on that image with the same settings as in TraP - do all the source extractions look normal?

ycendes commented 8 years ago

I just ran this again over the weekend, but unfortunately it still crashed at the same spot: http://banana.transientskp.org/master/vlo_SB002cendes/dataset/4/

I unfortunately forgot to get the log as I was running it to confirm it's the same error, will do that again now to double check, but if it's crashing at the same spot I'm guessing Gij's update didn't fix it.

gijzelaerr commented 8 years ago

which version from where are you using? Are you sure you are using the latest checkout from master?

ycendes commented 8 years ago

Wait, maybe not.

Tell me, what's the point of always having to update the latest in your virtual environment yourself instead of just running the latest automatically? Because I keep forgetting!! :-/

gijzelaerr commented 8 years ago

so you have maximum control over which version/branch exactly you want to run. the nightly build in /soft is only updated once a day.

ycendes commented 8 years ago

Wait, so perhaps it was the right version then (as I imagine you updated it before Friday). How do I double check? Because trap-manage.py -h trap-manage.py -version don't help on that, and it's not in the logs anywhere that I can see.

gijzelaerr commented 8 years ago

for example with running git log in the tkp checkout in your homefolder you can see a list of the latest commits, you can check if the accepted PR is in there.

ycendes commented 8 years ago

I'm afraid I don't follow you. Where do I type git log to get that log? I've tried the home folder and a few other places but always get this error,

fatal: Not a git repository (or any parent up to mount point /home) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).

gijzelaerr commented 8 years ago

inside the folder where you checked out tkp from github, probably named tkp inside your homefolder.

2015-11-16 15:35 GMT+02:00 ycendes notifications@github.com:

I'm afraid I don't follow you. Where do I type git log to get that log? I've tried the home folder and a few other places but always get this error,

fatal: Not a git repository (or any parent up to mount point /home) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).

— Reply to this email directly or view it on GitHub https://github.com/transientskp/tkp/issues/486#issuecomment-157028194.

Gijs Molenaar http://pythonic.nl

ycendes commented 8 years ago

Oh, ok thanks. This is the last update:

commit c426fd42ae74bfd950cb96c10fca1accd31c12e4 Merge: 8c26891 d75e377 Author: F. Huizinga folkerthuizinga@gmail.com Date: Tue Oct 20 15:39:04 2015 +0200

Merge pull request #1 from Error323/NaN-rejectreason

Add new rejectreason

Mind, I would still seriously prefer it if there was a way to just automatically have the latest daily build as the default, and then play around with the version I have as needed, because as a user over developer I'm really not needing the maximum control over what version I have versus just "the latest." It works much better for me. Is that possible?

gijzelaerr commented 8 years ago

just run git checkout master and git pull before you run the pipeline.

Otherwise see above, use the old method you have been using before (I think), the nightly from /soft.

2015-11-16 15:49 GMT+02:00 ycendes notifications@github.com:

Oh, ok thanks. This is the last update:

commit c426fd4 https://github.com/transientskp/tkp/commit/c426fd42ae74bfd950cb96c10fca1accd31c12e4 Merge: 8c26891 https://github.com/transientskp/tkp/commit/8c26891a58126806b741453f3d0655258f167578 d75e377 https://github.com/transientskp/tkp/commit/d75e377f2e57b1076f69c74c11e59adf308f1345 Author: F. Huizinga folkerthuizinga@gmail.com Date: Tue Oct 20 15:39:04 2015 +0200

Merge pull request #1 from Error323/NaN-rejectreason

Add new rejectreason

Mind, I would still seriously prefer it if there was a way to just automatically have the latest daily build as the default, and then play around with the version I have as needed, because as a user over developer I'm really not needing the maximum control over what version I have versus just "the latest." It works much better for me. Is that possible?

— Reply to this email directly or view it on GitHub https://github.com/transientskp/tkp/issues/486#issuecomment-157031356.

Gijs Molenaar http://pythonic.nl

Error323 commented 8 years ago

She also needs to run python setup.py install right?

gijzelaerr commented 8 years ago

no. She installed TKP in her virtualenv in developer mode (pip install -e .) which means it is not copied but symlinked. If you do a checkout of a specific branch it will automatically use that version in the virtualenv.

ycendes commented 8 years ago

Good to know. Anyway, should be updated now, and it's running again.

Error323 commented 8 years ago

You so fancy @gijzelaerr !

ycendes commented 8 years ago

I'm afraid it still failed, same image and reason. And yes, I did "git checkout master" and "git pull" before running it.

http://banana.transientskp.org/master/vlo_SB002cendes/dataset/5/

(aartfaacenv)ycendes@struis:/scratch/ycendes/aartfaac/cendes-pipeline/SB003$ tail log2.txt extraction_results = runner.map("extract_sources", urls, arguments) File "/home/ycendes/aartfaacenv/local/lib/python2.7/site-packages/tkp-2.2a0-py2.7.egg/tkp/distribute/init.py", line 42, in map return self.module.map(func, iterable, args) File "/home/ycendes/aartfaacenv/local/lib/python2.7/site-packages/tkp-2.2a0-py2.7.egg/tkp/distribute/multiproc/init.py", line 23, in map return pool.map(func, zipped) File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map return self.map_async(func, iterable, chunksize).get() File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get raise self._value RuntimeError: Spatial position is not a number

gijzelaerr commented 8 years ago

Can you run it in serial (can probably be done with the problematic image only) so we are absolutely sure you are using the right version.

ycendes commented 8 years ago

Good news, @Error323 did something to my system (he can confirm just what that was- something relating to Python versions) and it now works!

http://banana.transientskp.org/master/vlo_SB002cendes/dataset/6/

Note, nothing to do with this specific issue but we still get the crazy number of sources because SB002 has not gotten the new visibilities treatment. (Incidentally Folkert, if you have a moment to do it to SB002 that would be great.)

hsuyeep commented 8 years ago

I'm getting the same error, even though I am using the latest AARTFAAC TraP (nightly build with tag 0d1a771), on a similar kind of image. Looking at the source, it looks like the RunTimeError being raised is being caught with a warning everywhere except at the call to pyse in line 402 of struis:soft/trap/aartfaac/lib/python2.7/site-packages/tkp/sourcefinder/image.py

INFO:tkp.steps.source_extraction:Extracting image: /scratch/peeyush/pipeline/testbeammodel/SB003_beammodelapplied_10secinteg/F5.73257e+07_S1-63_2013-11-20T13-51-51.340.image
WARNING:tkp.sourcefinder.extract:Physical coordinates failed at 1090.109361, -138.185746
WARNING:tkp.sourcefinder.image:Island not processed; unphysical?
Traceback (most recent call last):
  File "/soft/trap/aartfaac/bin/trap-manage.py", line 10, in <module>
    tkp.management.main()
  File "/soft/trap/aartfaac/lib/python2.7/site-packages/tkp/management.py", line 335, in main
    args.func(args)
  File "/soft/trap/aartfaac/lib/python2.7/site-packages/tkp/management.py", line 229, in run_job
    run(args.name, monitor_coords)
  File "/soft/trap/aartfaac/lib/python2.7/site-packages/tkp/main.py", line 144, in run
    extraction_results = runner.map("extract_sources", urls, arguments)
  File "/soft/trap/aartfaac/lib/python2.7/site-packages/tkp/distribute/__init__.py", line 42, in map
    return self.module.map(func, iterable, args)
  File "/soft/trap/aartfaac/lib/python2.7/site-packages/tkp/distribute/serial/__init__.py", line 3, in map
    x = [func(i, *arguments) for i in iterable]
  File "/soft/trap/aartfaac/lib/python2.7/site-packages/tkp/distribute/serial/tasks.py", line 22, in extract_sources
    return tkp.steps.source_extraction.extract_sources(url, extraction_params)
  File "/soft/trap/aartfaac/lib/python2.7/site-packages/tkp/steps/source_extraction.py", line 50, in extract_sources
    force_beam=extraction_params['force_beam']
  File "/soft/trap/aartfaac/lib/python2.7/site-packages/tkp/sourcefinder/image.py", line 402, in extract
    labelled_data=labelled_data, labels=labels
  File "/soft/trap/aartfaac/lib/python2.7/site-packages/tkp/sourcefinder/image.py", line 864, in _pyse
    det = extract.Detection(measurement, self, chunk=island.chunk)
  File "/soft/trap/aartfaac/lib/python2.7/site-packages/tkp/sourcefinder/extract.py", line 736, in __init__
    self._physical_coordinates()
  File "/soft/trap/aartfaac/lib/python2.7/site-packages/tkp/sourcefinder/extract.py", line 802, in _physical_coordinates
    [self.x.value, self.y.value])]
  File "/soft/trap/aartfaac/lib/python2.7/site-packages/tkp/utility/coordinates.py", line 661, in p2s
    raise RuntimeError("Spatial position is not a number")
RuntimeError: Spatial position is not a number
hsuyeep commented 8 years ago

Note that I asked Folkert about the magic he did based on YC's comment above, and it was a python setup.py install. Mayeb it is better I also use the latest TraP pull in my virtualenv, as the nightly build is not working out of the box for me...

gijzelaerr commented 8 years ago

You are not using a nightly, but the aartfaac branch. See your copy paste, you can see it uses python modules from /soft/trap/aartfaac/.