rejuvyesh / PyCallChainRules.jl

Differentiate python calls from Julia
MIT License
56 stars 2 forks source link

Randomly occuring segmentation fault #29

Open RomeoV opened 1 year ago

RomeoV commented 1 year ago

Hello, thanks for the great work! I'm currently trying to wrap a pytorch model into a Flux based training setup. The training seems to go fine for a few epochs, however seemingly at random, a segmentation fault occurs (see below). I don't have a great MWE right now (I'll try to make one still), but perhaps we can already make some conclusions based on the stacktrace, which here happened after about seven epochs:

[56770] signal (11.1): Segmentation fault
in expression starting at /home/romeo/Documents/Stanford/google_ood/DisentanglingVAE.jl/scripts/vae_CUB.jl:213
PyErr_Occurred at /usr/lib/libpython3.10.so.1.0 (unknown line)
pyerr_occurred at /home/romeo/.julia/packages/PyCall/twYvK/src/exception.jl:69 [inlined]
pyerr_check at /home/romeo/.julia/packages/PyCall/twYvK/src/exception.jl:75 [inlined]
############# LOOK HERE vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
share at /home/romeo/.julia/packages/DLPack/SUhao/src/pycall.jl:109
#13 at /home/romeo/.julia/packages/PyCallChainRules/YR5iR/src/pytorch.jl:59
#########################^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
unknown function (ip: 0x7ff3e1725d52)
map at ./tuple.jl:292
unknown function (ip: 0x7ff3e1723e23)
_jl_invoke at /cache/build/default-amdci4-7/julialang/julia-release-1-dot-9/src/gf.c:2681 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-7/julialang/julia-release-1-dot-9/src/gf.c:2863
#rrule#12 at /home/romeo/.julia/packages/PyCallChainRules/YR5iR/src/pytorch.jl:59
rrule at /home/romeo/.julia/packages/PyCallChainRules/YR5iR/src/pytorch.jl:56 [inlined]
rrule at /home/romeo/.julia/packages/ChainRulesCore/a4mIA/src/rules.jl:134 [inlined]
chain_rrule at /home/romeo/.julia/packages/Zygote/xGkZ5/src/compiler/chainrules.jl:218 [inlined]
macro expansion at /home/romeo/.julia/packages/Zygote/xGkZ5/src/compiler/interface2.jl:0 [inlined]
_pullback at /home/romeo/.julia/packages/Zygote/xGkZ5/src/compiler/interface2.jl:9
unknown function (ip: 0x7ff3e1723a4d)
_jl_invoke at /cache/build/default-amdci4-7/julialang/julia-release-1-dot-9/src/gf.c:2681 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-7/julialang/julia-release-1-dot-9/src/gf.c:2863
_pullback at /home/romeo/Documents/Stanford/google_ood/DisentanglingVAE.jl/scripts/vae_CUB.jl:166 [inlined]

Here are the referenced code snippets in the stacktrace: https://github.com/rejuvyesh/PyCallChainRules.jl/blob/1723781d955c2f0df479df1e2f9e983a377865fb/src/pytorch.jl#L56-L64 and https://github.com/pabloferz/DLPack.jl/blob/61f48ee6b5e4f56d9b8525fa6ef9b613242160b8/src/pycall.jl#L98-L116

RomeoV commented 1 year ago

Here is a github gist which reproduces the error: https://gist.github.com/RomeoV/ca397a6b883c1cf567f2503d135084d8

The setup is generally inspired by the VAE tutorial in the FastAI doc.

rejuvyesh commented 1 year ago

Given dlpack and garbage collection is involved, could very well be related to #24 (the interaction with Julia GC and Python GC). What versions of pytorch/functorch are you using? Would it be possible to check your interaction with PyNNTraining implementation: https://github.com/lorenzoh/PyNNTraining.jl/blob/e02bf899ce7228090a60286b8373fb87bfa5b6b1/src/topytorch.jl#L34