Open rhoens opened 7 years ago
I am also getting this error with CNTK Library Managed API from C#. I am creating batches and running evaluation on them (using GPU device), and depending on batch size I get this error randomly at some point during evaluation.
Which CNTK version are you using? Would it be possible to share a repo for further investigation?
Model was trained using CNTK-2-0-beta8-0-Windows-64bit-GPU-1bit-SGD and evaluation is done using latest nuget package for CNTK Library Managed API. Sure, I've attached related piece of code.
Here is full exception stack:
Microsoft::MSR::CNTK::GPUMatrix2 arguments, Dictionary
2 outputs, DeviceDescriptor computeDevice)
ReproCode.txt
@markorakita @rhoens We have not found any issue in your code. We doubt that it could be a bug in CNTK. Would it be possible to share a repo for further investigation? Thanks,
@zhouwangzw I've sent you an email containing repro code + trained model + dataset. I've narrowed down what causes the exception in my case, I am calling evaluate with 16 items of size 32x32x3, but sometimes when I am at the end of dataset I call it with for example 3 items in a batch, and that causes exception to appear. Seems like bug in CNTK.
The two options are:
1) something is maintaining a reference it shouldn't/wasn't expected. 2) someone is using Resize instead of RequireSize.
This might be a weird interaction with the python API + Math back end.
On Tue, Feb 7, 2017 at 9:43 AM, markorakita notifications@github.com wrote:
@zhouwangzw https://github.com/zhouwangzw I've sent you an email containing repro code + trained model + dataset. I've narrowed down what causes the exception in my case, I am calling evaluate with 16 items of size 32x32x3, but sometimes when I am at the end of dataset I call it with for example 3 items in a batch, and that causes exception to appear. Seems like bug in CNTK.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Microsoft/CNTK/issues/1354#issuecomment-278019401, or mute the thread https://github.com/notifications/unsubscribe-auth/AJxfH81A5vS58PpuybD1n7sEf_PAD4g9ks5raIMUgaJpZM4Lsalg .
-- --T. Ryan
To those who are still seeing this, are you always sending the same minibatch size to evaluate? We found that it ours works again if all the minibatch sizes are the same (1 is what we set it to in our case)
I really have to ask, what's the point of testing batched evaluation with batch size of 1? :)
As I said in my previous post: "I am calling evaluate with 16 items of size 32x32x3, but sometimes when I am at the end of dataset I call it with for example 3 items in a batch, and that causes exception to appear". In another words, evaluation throws exception when you call it with batch size of 16 and right after that with batch size of 3.
i am getting the same error. i cannot be using always the same size of batches and using a size of only one does not really make sense. Do we all agree this is a bug? is it going to be fixed?
I have the same problem. It occurs when you are processing batches of a certain size and then change the size to accommodate the remaining images at the end.
In my little test dataset I have 13 images (prime number). Referring to the cntk c# example, CNTKLibraryCSEvalGPUExamples of processing with EvaluationBatchOfImages, if I were to load up all 13 images, at my equivalent of the seqData.AddRange(resizedCHW) line, it works fine at the modelFunc.Evaluate(inputDataMap, outputDataMap, device) line. However if I load up 5 images then evaluate, another 5 and evaluate (fine so far) and then the last 3 images, it generates the "RuntimeError: Resize: Cannot resize the matrix because it is a view." at the evaluate. Similarly for 6x2 images and then the final image.
Thanks for reporting the issue. We are currently investigating it.
My workaround is to pad out the batch with dummy images. Eg. if previous runs loaded 100 images at a time and then you only have 25 left in the final run, I pad out the batch to 100 images re-using the last image and then ignore the outputs of the dummy 75. It's a waste of computing resources but it works fine.
Ditto. My test data are thousands of speech sequences. I use HTKFeature Reader in Python API . Because i need the model output of each sentence. It's would be convenient to send in a precise number of frames in a minibatch to take care of variable-length sentences. I got the same exception when i changed mbsize in reader.next_minibatch().
The bug is fixed in 2.0RC2.
I've encountered the same error on {2.0rc2, gpu, lstm, adam}. It is also intermittent in my case.
while len(mb) > 0:
trainer.train_minibatch(mb)
mb = reader.next_minibatch(minibatch_size * avg_seq_len, input_map=input_map)
Hello everyone, it seems that when we use a variable-length minibatch size, it should be changed to an integer, for examlple, reader.next_minibatch( int(minibatch_size * avg_seq_len), input_map=input_map). You can try it. 发自网易邮箱大师 On 06/10/2017 22:40, Aayush Garg wrote: I have encountered this error as well on 2.0 release.
—You are receiving this because you commented.Reply to this email directly, view it on GitHub, or mute the thread.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/Microsoft/CNTK","title":"Microsoft/CNTK","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/Microsoft/CNTK"}},"updates":{"snippets":[{"icon":"PERSON","message":"@aayushgarg in #1354: I have encountered this error as well on 2.0 release."}],"action":{"name":"View Issue","url":"https://github.com/Microsoft/CNTK/issues/1354#issuecomment-307569117"}}}
It seems that only the bug causing the "resize" error during Evaluation has been fixed since 2.0RC2, and there is another bug causing the same intermittent error when using next_minibatch. We are investigating it.
The bug causing the "resize" error when using get_next_minbatch() has now been fixed in master, and will be included in the next binary release. It affects only Python.
I am closing the issue, and feel free to reopen it if needed.
Unfortunately I am having this problem again. I use the most recent version of the GPU package from nuget.
I configure the minibatchSource to only give one epoch of the data:
config.SetMaxSweeps(1);
Then, when I try to get minibatch number X, I get the following exception:
minibatchSource.GetNextMinibatch(128, device);
Exception:
Microsoft::MSR::CNTK::GPUSparseMatrix<ElemType>::Resize: Cannot resize the matrix because it is a view.
[CALL STACK]
> Microsoft::MSR::CNTK::GPUSparseMatrix<float>:: Resize
- Microsoft::MSR::CNTK::GPUSparseMatrix<float>:: RequireSizeAndAllocate
- Microsoft::MSR::CNTK::GPUSparseMatrix<float>:: SetMatrixFromCSCFormat
- Microsoft::MSR::CNTK::Matrix<float>:: SetMatrixFromCSCFormat
- Microsoft::MSR::CNTK::DataTransferer:: operator=
- Microsoft::MSR::CNTK::Matrix<float>:: __autoclassinit2
- Microsoft::MSR::CNTK::DataTransferer:: operator= (x4)
- Microsoft::MSR::CNTK::IDataReader:: operator= (x2)
- Concurrency::details::_ContextCallback:: _CallInContext
- RtlSetThreadWorkOnBehalfTicket (x2)
- BaseThreadInitThunk
I suppose minibatch X is the last minibatch I would get from the minibatch source. Sometimes the exception is not thrown (I could retrieve all minibatches including the last one with fewer samples), but I could not figure out why.
There is never thrown an exception when the minibatch size i1. It does not matter if I choose to use a CPU device instead.
Hello, I´m also seeing this problem intermittently. I´m using the latest version and training through python. Just for comparison, I get the following error message on a get_next_minibatch
call:
RuntimeError: Microsoft::MSR::CNTK::GPUMatrix<ElemType>::Resize: Cannot resize the matrix because it is a view.
[CALL STACK]
> Microsoft::MSR::CNTK::GPUMatrix<float>:: Resize
- Microsoft::MSR::CNTK::GPUMatrix<float>:: SetValue
- Microsoft::MSR::CNTK::Matrix<float>:: SetValue
- Microsoft::MSR::CNTK::TracingGPUMemoryAllocator:: operator=
- Microsoft::MSR::CNTK::Matrix<float>:: __autoclassinit2
- Microsoft::MSR::CNTK::TracingGPUMemoryAllocator:: operator= (x4)
- Microsoft::MSR::CNTK::IDataReader:: operator= (x2)
- Concurrency::details::_ContextCallback:: _CallInContext
- RtlReleaseSRWLockExclusive (x2)
- BaseThreadInitThunk
- RtlUserThreadStart
Hello! Same problem on C# when I'm reading sequences from CBF file with GetNextMinibatch method. CNTK ver. 2.4
System.ApplicationException: Microsoft::MSR::CNTK::GPUMatrix<ElemType>::Resize: Cannot resize the matrix because it is a view.
[CALL STACK]
> Microsoft::MSR::CNTK::CudaTimer:: Stop
- Microsoft::MSR::CNTK::GPUMatrix<float>:: Resize
- Microsoft::MSR::CNTK::GPUMatrix<float>:: SetValue
- Microsoft::MSR::CNTK::Matrix<float>:: SetValue
- Microsoft::MSR::CNTK::DataTransferer:: operator=
- std::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase> (x2)
- Microsoft::MSR::CNTK::DataTransferer:: operator=
- std::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>
- Microsoft::MSR::CNTK::IDataReader:: operator= (x2)
- Concurrency::details::_ContextCallback:: _CallInContext
- RtlSetThreadWorkOnBehalfTicket (x2)
- BaseThreadInitThunk
- RtlUserThreadStart
I have investigated this problem. This exception arising on GetNextMinibatch after reading data from input/output dictionaries (after deprecated methods too). I applied Erase function to input/output data after reading and this fixed the problem. But I still think that it is bug. In my opinion, problem is somewhere in data allocation.
public void Test(string testDataPath, string modelPath, UInt32 minibatchSize)
{
var reader = CreateMiniBatchSource(testDataPath, isTraining: false);
Function model = Function.Load(modelPath, _device);
Variable input = model.Arguments[0];
Variable output = model.Outputs[1];
StreamInformation inputInfo = reader.StreamInfo("features");
StreamInformation outputInfo = reader.StreamInfo("labels");
for (int i = 0; i < 500; ++i)
{
var data = reader.GetNextMinibatch(minibatchSize, _device);
if (data == null || data.empty())
break;
var inputData = new Dictionary<Variable, Value>
{
{ input, data[inputInfo].data },
};
Dictionary<Variable, Value> outputData = new Dictionary<Variable, Value>()
{
{ output, null }
};
model.Evaluate(inputData, outputData, _device);
var predicted = outputData[output].GetDenseData<float>(output);
var expected = data[outputInfo].data.GetDenseData<float>(output);
// without this, 'System.ApplicationException:Miccrosoft::MSR::CNTK::GPUMatrix<ElemType>::Resize' will arise
outputData[output].Erase();
data[outputInfo].data.Erase();
var joinedResults = predicted
.Zip(
expected,
(f, s) => String.Join(";", "(" + String.Join(" ", f) + ")", "(" + String.Join(" ", s) + ")")
);
Console.WriteLine($"Iter {i} results:");
Console.WriteLine(String.Join(Environment.NewLine, joinedResults));
}
}
We are also intermitently seeing this on CNTK GPU 2.3 when calling via C# API. Relevant API:
System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.ApplicationException: Microsoft::MSR::CNTK::GPUMatrix<ElemType>::Resize: Cannot resize the matrix because it is a view.
[CALL STACK]
> Microsoft::MSR::CNTK::GPUMatrix<float>:: Resize
- Microsoft::MSR::CNTK::Matrix<float>:: Resize
- Microsoft::MSR::CNTK::TracingGPUMemoryAllocator:: operator= (x4)
- CNTK::Internal:: UseSparseGradientAggregationInDataParallelSGD
- Microsoft::MSR::CNTK::TracingGPUMemoryAllocator:: operator=
- CNTK::Internal:: UseSparseGradientAggregationInDataParallelSGD
- CNTK::Function:: Forward
- CNTK::Function:: Evaluate
- CSharp_CNTK_Function__Evaluate__SWIG_0
- 00007FF99DF67E77 (SymFromAddr() error: The specified module could not be found.)
at CNTK.Function._Evaluate(UnorderedMapVariableValuePtr arguments, UnorderedMapVariableValuePtr outputs, DeviceDescriptor computeDevice)
at CNTK.Function.Evaluate(IDictionary`2 inputs, IDictionary`2 outputs, Boolean createPersistentOutputValues, DeviceDescriptor computeDevice)
Will try workaround suggested by @elevir
I am getting this error consistently in CNTK 2.5.1 using the managed C# API.
System.ApplicationException
HResult=0x80131600
Message=Resize: Cannot resize the matrix because it is a view.
[CALL STACK]
> Microsoft::MSR::CNTK::CPUMatrix<double>:: _rcrfTransGrdCompute
- Microsoft::MSR::CNTK::CPUMatrix<float>:: Resize
- Microsoft::MSR::CNTK::CPUMatrix<float>:: SetValue
- Microsoft::MSR::CNTK::Matrix<float>:: SetValue
- std::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>:: operator=
- std::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase> (x2)
- std::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>:: operator=
- std::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>
- Microsoft::MSR::CNTK::IDataReader:: operator= (x2)
- Concurrency::details::_ContextCallback:: _CallInContext
- RtlAcquireSRWLockExclusive
- RtlReleaseSRWLockExclusive
- BaseThreadInitThunk
- RtlUserThreadStart
Source=Cntk.Core.Managed-2.5.1
StackTrace:
at CNTK.MinibatchSource.GetNextMinibatch(UInt32 minibatchSizeInSamples, DeviceDescriptor device)
at ```
It seems to occur a the end of a data sweep during training, but without changing the batch size (in this example this is 32). But not the end of the first data sweep, but somehow at the end of every 23rd data sweep. This is pretty consistent. I can't share the data, though.
So this does not seem to have been fully resolved. Note this is running on CPU only. Not GPU.
Calling .data.Erase()
on inputs and iterations on training data like @elevir commented seems to resolve the issue for me too. And I agree, this is still a bug.
Unfortunately, if using:
public class Evaluator : IDisposable
{
public double TestMinibatch(UnorderedMapVariableMinibatchData arguments,
UnorderedMapVariableValuePtr outputsToFetch, DeviceDescriptor computeDevice);
}
for say validation testing (i.e. after a training epoch) this same exception occurs. Where the outputs to fetch are the actual outputs of the network and the loss. TestMinibatch
only reports the "evaluation" value, which is not enough. And now calling .Erase()
does not help. Which means this seems to be impossible to do now. :|
@zhouwangzw please reopen this issue.
TLDR: Erase()/Dispose()
any Value
instances returned. Incl. Value
instances returned from .data
property on e.g. MinibatchData
.
Following I have isolated the following call that seems to trigger this exception:
var expectedOutputResults = targetsData.data.GetDenseData<float>(expectedOutput);
where targetsData
is a minibatch loaded from CTF file with 3 element vector called targets
and a mask
in this file too. If running without this line, it works, with the line in, it fails. Not on first run, but second run (i.e. second full sweep). E.g. CTF file has lines like:
|targets 0 -1 0 |mask 1 0 1
The exception occurs even with:
targetsData.data.Erase();
at the end of every loop.
After discovering this it also appears outputsToFetch
doesn't matter, what matters is trying to get data (via GetDenseData
) after TestMinibatch
is run. This fails every time.
I then inserted a DeepClone
call before the GetDenseData
call:
var targetsDataClone = targetsData.data.DeepClone(false);
var expectedOutputResults = targetsDataClone.GetDenseData<float>(expectedOutput);
and the exception does not occur. This got me thinking that the problem perhaps is related to .data
being a SWIG generated property that probably returns a new Value
instance as a view over existing data inside. And then writing:
var targetsDataValue = targetsData.data;
var expectedOutputResults = targetsDataValue.GetDenseData<float>(expectedOutput);
targetsDataValue.Erase();
targetsDataValue.Dispose();
does not cause an exception too. I assume this is due to the Value
instance being erased/disposed.
This then would make me assume, that as long as any "resource" has a "read-only" view Value
instance associated with it, it cannot resize. Why a "resize" to a size that is the same as the old size can cause an error due to a view existing I am not sure. Nevertheless, it seems one must always ensure that Value
s returned are erased/disposed inside a loop.
This problem/issue exists both for CPU and GPU.
Note that this is not necessarily deterministic, the exception does not occur always... probably more a result of lack of understanding the different sources of Value
. And that there is a difference for these.
cc: @mdabros
This error also occurred under c# (v2.7.0). Here is my code:
Parallel.For(0, 10000 , (i) =>
{
float[] raw = pRasterLayerCursorTool.PickRagneNormalValue( 10, 10, 9, 9);
int cover = dqn.Predict(state);
});
public float[] Predict(float[] input)
{
using (Value inputsValue = Value.CreateBatch(inputVariable.Shape, input, device))
{
var inputDict = new Dictionary<Variable, Value>() { { inputVariable, inputsValue } };
var outputDict = new Dictionary<Variable, Value>() { { classifierOutput.Output, null } };
classifierOutput.Evaluate(inputDict, outputDict, device);
IList<IList<float>> prdicts = outputDict[classifierOutput.Output].GetDenseData<float>(classifierOutput.Output);
float[] result = prdicts[0].ToArray();
return result;
}
}
I have solved it by lock object in Parallel,
Parallel.For(0, 10000 , (i) =>{
//the type of model is 'CNTK.Function'
lock(model){
float[] raw = pRasterLayerCursorTool.PickRagneNormalValue( 10, 10, 9, 9);
int cover = model.Predict(state);
};
});
@axmand You are effectively executing synchronously, while possibly hijacking many threads (and so hurting performance). Just turning into symple synchronous loop would be much better.
I cannot recall if CNTK model is safe for usage in parallel, but if it is you can try to keep your Parallel loop and inside try to erase the inputDict/outputDict as suggested by @elevir If it is not - the stick with simple synchronous execution (but it can still be a good idea to cleanup the input/output Dictionaries on each eval call)
@axmand @jakrivan the following page:
https://docs.microsoft.com/en-us/cognitive-toolkit/cntk-library-evaluation-on-windows
Clearly states:
CNTK supports evaluating multiple requests in parallel. Because running evaluation on the same model instance is not thread-safe, it is required first to create multiple model instances by calling Clone() with ParameterCloningMethod.Share, and then each thread uses a separate model instance for evaluation. The EvaluateMultipleImagesInParallelAsync() demonstrates how to evaluate concurrent requests using CNTK C#/.NET Managed API.
Running in parallel if on CPU probably won't help much since the underlying code is heavily threaded anyway. Hence, as @jakrivan says you are better of not doing Parallel.For
.
The problem we were seeing was not due to parallel for.
I have investigated this problem. This exception arising on GetNextMinibatch after reading data from input/output dictionaries (after deprecated methods too). I applied Erase function to input/output data after reading and this fixed the problem. But I still think that it is bug. In my opinion, problem is somewhere in data allocation.
public void Test(string testDataPath, string modelPath, UInt32 minibatchSize) { var reader = CreateMiniBatchSource(testDataPath, isTraining: false); Function model = Function.Load(modelPath, _device); Variable input = model.Arguments[0]; Variable output = model.Outputs[1]; StreamInformation inputInfo = reader.StreamInfo("features"); StreamInformation outputInfo = reader.StreamInfo("labels"); for (int i = 0; i < 500; ++i) { var data = reader.GetNextMinibatch(minibatchSize, _device); if (data == null || data.empty()) break; var inputData = new Dictionary<Variable, Value> { { input, data[inputInfo].data }, }; Dictionary<Variable, Value> outputData = new Dictionary<Variable, Value>() { { output, null } }; model.Evaluate(inputData, outputData, _device); var predicted = outputData[output].GetDenseData<float>(output); var expected = data[outputInfo].data.GetDenseData<float>(output); // without this, 'System.ApplicationException:Miccrosoft::MSR::CNTK::GPUMatrix<ElemType>::Resize' will arise outputData[output].Erase(); data[outputInfo].data.Erase(); var joinedResults = predicted .Zip( expected, (f, s) => String.Join(";", "(" + String.Join(" ", f) + ")", "(" + String.Join(" ", s) + ")") ); Console.WriteLine($"Iter {i} results:"); Console.WriteLine(String.Join(Environment.NewLine, joinedResults)); } }
This is what fixed this issue for me (using the C# API). I have checked the C++ code, and looks like this exception is thrown when an exclusive access to the shared_ptr pointing to the matrix in question could not be ensured. This may explain why Erase() fixes the issue. What remains to be explained is why GetDenseData would keep on holding a reference after the data has been fetched.
I get this error as well on evaluation c# API. To make it work I need to use the "minibatchSizeInSample" equal to 1 (to evaluate each sequence) or I need to use the "minibatchSizeInSample" equal to the number of max samples (which is not always feasible for the amount of evaluation data maintained in memory).
[CALL STACK]
Microsoft::MSR::CNTK::ConvolutionEngine
:: SetmMaxTempMemSizeInSamples
- Microsoft::MSR::CNTK::CPUSparseMatrix
:: Resize - Microsoft::MSR::CNTK::CPUSparseMatrix
:: SetMatrixFromCSCFormat - Microsoft::MSR::CNTK::Matrix
:: SetMatrixFromCSCFormat - CNTK::TrainingParameterSchedule
:: GetMinibatchSize (x4) - Microsoft::MSR::CNTK::IDataReader:: operator= (x2)
- Concurrency::details:: _Schedule_chore
- RtlInitializeCriticalSection
- LdrAccessResource
- BaseThreadInitThunk
- RtlUserThreadStart
@Pescu from our observations we concluded the problems were all related to the use of the built-in data readers in CNTK. After switching away from these to custom built ones we have not seen these issues anymore. Not a big help... unfortunately.
When running a test run over a model, I've gotten this error twice:
Traceback (most recent call last): File "test.py", line 65, in
mb = reader.next_minibatch(minibatch_size, input_map=input_map)
File "/root/anaconda3/envs/cntk-py34/lib/python3.4/site-packages/cntk/utils/swig_helper.py", line 58, in wrapper
result = f(*args, *kwds)
File "/root/anaconda3/envs/cntk-py34/lib/python3.4/site-packages/cntk/io/init.py", line 161, in next_minibatch
minibatch_size_in_samples, device)
File "/root/anaconda3/envs/cntk-py34/lib/python3.4/site-packages/cntk/cntk_py.py", line 1916, in get_next_minibatch
return _cntk_py.MinibatchSource_get_next_minibatch(self, args)
RuntimeError: Resize: Cannot resize the matrix because it is a view.
This happened 2 invocations in a row, but running it a 3rd time seems to have "fixed" the issue. Is this known behavior?