hkchengrex / Tracking-Anything-with-DEVA

[ICCV 2023] Tracking Anything with Decoupled Video Segmentation
https://hkchengrex.com/Tracking-Anything-with-DEVA/
Other
1.26k stars 129 forks source link

RuntimeWarning: Trying to segment without any memory #90

Closed Masrur02 closed 5 months ago

Masrur02 commented 5 months ago

Hi, I am getting a Runtime warning; Trying to segment without any memory!, while I am trying to run the demo_automatic.py file. And after the completion of the code, the outputs are not segmented. The outputs are simply the input images. I tried to play with the parameters also to reduce memory usage. Another problem is when I run the demo_with_text file, I get this log

UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1659484809662/work/aten/src/ATen/native/TensorShape.cpp:2894.)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
final text_encoder_type: bert-base-uncased
Configuration: {'model': './saves/DEVA-propagation.pth', 'output': './example/output', 'save_all': False, 'amp': True, 'key_dim': 64, 'value_dim': 512, 'pix_feat_dim': 512, 'disable_long_term': False, 'max_mid_term_frames': 10, 'min_mid_term_frames': 5, 'max_long_term_elements': 10000, 'num_prototypes': 128, 'top_k': 30, 'mem_every': 5, 'chunk_size': 4, 'size': 480, 'GROUNDING_DINO_CONFIG_PATH': './saves/GroundingDINO_SwinT_OGC.py', 'GROUNDING_DINO_CHECKPOINT_PATH': './saves/groundingdino_swint_ogc.pth', 'DINO_THRESHOLD': 0.35, 'DINO_NMS_THRESHOLD': 0.8, 'SAM_ENCODER_VERSION': 'vit_h', 'SAM_CHECKPOINT_PATH': './saves/sam_vit_h_4b8939.pth', 'MOBILE_SAM_CHECKPOINT_PATH': './saves/mobile_sam.pt', 'SAM_NUM_POINTS_PER_SIDE': 64, 'SAM_NUM_POINTS_PER_BATCH': 64, 'SAM_PRED_IOU_THRESHOLD': 0.88, 'SAM_OVERLAP_THRESHOLD': 0.8, 'img_path': './example/vipseg/images', 'detection_every': 5, 'num_voting_frames': 3, 'temporal_setting': 'semionline', 'max_missed_detection_count': 10, 'max_num_objects': -1, 'prompt': 'person.hat.horse', 'sam_variant': 'original', 'enable_long_term': True, 'enable_long_term_count_usage': False}

  0%|          | 0/2 [00:00<?, ?it/s]UserWarning: None of the inputs have requires_grad=True. Gradients will be None

  0%|          | 0/2 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/home/soicroot/Downloads/Khan/Lane/DEVA/demo/demo_with_text.py", line 64, in <module>
    process_frame(deva, gd_model, sam_model, im_path, result_saver, ti, image_np=frame)
  File "/home/soicroot/miniconda3/envs/DEVA/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/soicroot/Downloads/Khan/Lane/DEVA/deva/ext/with_text_processor.py", line 58, in process_frame_with_text
    mask, segments_info = make_segmentation_with_text(cfg, image_np, gd_model, sam_model,
  File "/home/soicroot/Downloads/Khan/Lane/DEVA/deva/ext/with_text_processor.py", line 25, in make_segmentation_with_text
    mask, segments_info = segment_with_text(cfg, gd_model, sam_model, image_np, prompts, min_side)
  File "/home/soicroot/Downloads/Khan/Lane/DEVA/deva/ext/grounding_dino.py", line 73, in segment_with_text
    detections = gd_model.predict_with_classes(image=cv2.cvtColor(image, cv2.COLOR_RGB2BGR),
  File "/home/soicroot/Downloads/Khan/Lane/DEVA/Grounded-Segment-Anything/GroundingDINO/groundingdino/util/inference.py", line 195, in predict_with_classes
    boxes, logits, phrases = predict(
  File "/home/soicroot/Downloads/Khan/Lane/DEVA/Grounded-Segment-Anything/GroundingDINO/groundingdino/util/inference.py", line 67, in predict
    outputs = model(image[None], captions=[caption])
  File "/home/soicroot/miniconda3/envs/DEVA/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/soicroot/Downloads/Khan/Lane/DEVA/Grounded-Segment-Anything/GroundingDINO/groundingdino/models/GroundingDINO/groundingdino.py", line 313, in forward
    hs, reference, hs_enc, ref_enc, init_box_proposal = self.transformer(
  File "/home/soicroot/miniconda3/envs/DEVA/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/soicroot/Downloads/Khan/Lane/DEVA/Grounded-Segment-Anything/GroundingDINO/groundingdino/models/GroundingDINO/transformer.py", line 248, in forward
    (spatial_shapes.new_zeros((1,)), spatial_shapes.prod(1).cumsum(0)[:-1])
RuntimeError: 
  #define POS_INFINITY __int_as_float(0x7f800000)
  #define INFINITY POS_INFINITY
  #define NEG_INFINITY __int_as_float(0xff800000)
  #define NAN __int_as_float(0x7fffffff)

  typedef long long int int64_t;
  typedef unsigned int uint32_t;
  typedef signed char int8_t;
  typedef unsigned char uint8_t;  // NOTE: this MUST be "unsigned char"! "char" is equivalent to "signed char"
  typedef short int16_t;
  static_assert(sizeof(int64_t) == 8, "expected size does not match");
  static_assert(sizeof(uint32_t) == 4, "expected size does not match");
  static_assert(sizeof(int8_t) == 1, "expected size does not match");
  constexpr int num_threads = 128;
  constexpr int thread_work_size = 4; // TODO: make template substitution once we decide where those vars live
  constexpr int block_work_size = thread_work_size * num_threads;
  //TODO use _assert_fail, because assert is disabled in non-debug builds
  #define ERROR_UNSUPPORTED_CAST assert(false);

  namespace std {

  using ::signbit;
  using ::isfinite;
  using ::isinf;
  using ::isnan;

  using ::abs;

  using ::acos;
  using ::acosf;
  using ::asin;
  using ::asinf;
  using ::atan;
  using ::atanf;
  using ::atan2;
  using ::atan2f;
  using ::ceil;
  using ::ceilf;
  using ::cos;
  using ::cosf;
  using ::cosh;
  using ::coshf;

  using ::exp;
  using ::expf;

  using ::fabs;
  using ::fabsf;
  using ::floor;
  using ::floorf;

  using ::fmod;
  using ::fmodf;

  using ::frexp;
  using ::frexpf;
  using ::ldexp;
  using ::ldexpf;

  using ::log;
  using ::logf;

  using ::log10;
  using ::log10f;
  using ::modf;
  using ::modff;

  using ::pow;
  using ::powf;

  using ::sin;
  using ::sinf;
  using ::sinh;
  using ::sinhf;

  using ::sqrt;
  using ::sqrtf;
  using ::tan;
  using ::tanf;

  using ::tanh;
  using ::tanhf;

  using ::acosh;
  using ::acoshf;
  using ::asinh;
  using ::asinhf;
  using ::atanh;
  using ::atanhf;
  using ::cbrt;
  using ::cbrtf;

  using ::copysign;
  using ::copysignf;

  using ::erf;
  using ::erff;
  using ::erfc;
  using ::erfcf;
  using ::exp2;
  using ::exp2f;
  using ::expm1;
  using ::expm1f;
  using ::fdim;
  using ::fdimf;
  using ::fmaf;
  using ::fma;
  using ::fmax;
  using ::fmaxf;
  using ::fmin;
  using ::fminf;
  using ::hypot;
  using ::hypotf;
  using ::ilogb;
  using ::ilogbf;
  using ::lgamma;
  using ::lgammaf;
  using ::llrint;
  using ::llrintf;
  using ::llround;
  using ::llroundf;
  using ::log1p;
  using ::log1pf;
  using ::log2;
  using ::log2f;
  using ::logb;
  using ::logbf;
  using ::lrint;
  using ::lrintf;
  using ::lround;
  using ::lroundf;

  using ::nan;
  using ::nanf;

  using ::nearbyint;
  using ::nearbyintf;
  using ::nextafter;
  using ::nextafterf;
  using ::remainder;
  using ::remainderf;
  using ::remquo;
  using ::remquof;
  using ::rint;
  using ::rintf;
  using ::round;
  using ::roundf;
  using ::scalbln;
  using ::scalblnf;
  using ::scalbn;
  using ::scalbnf;
  using ::tgamma;
  using ::tgammaf;
  using ::trunc;
  using ::truncf;

  } // namespace std

  // NB: Order matters for this macro; it is relied upon in
  // _promoteTypesLookup and the serialization format.
  // Note, some types have ctype as void because we don't support them in codegen
  #define AT_FORALL_SCALAR_TYPES_WITH_COMPLEX(_) \
  _(uint8_t, Byte) /* 0 */                               \
  _(int8_t, Char) /* 1 */                                \
  _(int16_t, Short) /* 2 */                              \
  _(int, Int) /* 3 */                                    \
  _(int64_t, Long) /* 4 */                               \
  _(at::Half, Half) /* 5 */                                  \
  _(float, Float) /* 6 */                                \
  _(double, Double) /* 7 */                              \
  _(std::complex<at::Half>, ComplexHalf) /* 8 */        \
  _(std::complex<float>, ComplexFloat) /* 9 */                          \
  _(std::complex<double>, ComplexDouble) /* 10 */                         \
  _(bool, Bool) /* 11 */                                 \
  _(void, QInt8) /* 12 */                          \
  _(void, QUInt8) /* 13 */                        \
  _(void, QInt32) /* 14 */                        \
  _(at::BFloat16, BFloat16) /* 15 */                             \

  #define AT_FORALL_SCALAR_TYPES_WITH_COMPLEX_EXCEPT_QINT(_)       \
  _(uint8_t, Byte)                                                 \
  _(int8_t, Char)                                                  \
  _(int16_t, Short)                                                \
  _(int, Int)                                                      \
  _(int64_t, Long)                                                 \
  _(at::Half, Half)                                                \
  _(float, Float)                                                  \
  _(double, Double)                                                \
  _(std::complex<at::Half>, ComplexHalf)                           \
  _(std::complex<float>, ComplexFloat)                             \
  _(std::complex<double>, ComplexDouble)                           \
  _(bool, Bool)                                                    \
  _(at::BFloat16, BFloat16)

  enum class ScalarType : int8_t {
  #define DEFINE_ENUM(_1, n) n,
  AT_FORALL_SCALAR_TYPES_WITH_COMPLEX(DEFINE_ENUM)
  #undef DEFINE_ENUM
      Undefined,
  NumOptions
  };

  template <typename T, int size>
  struct Array {
  T data[size];

  __device__ T operator[](int i) const {
      return data[i];
  }
  __device__ T& operator[](int i) {
      return data[i];
  }
  Array() = default;
  Array(const Array&) = default;
  Array& operator=(const Array&) = default;
  __device__ Array(T x) {
    for (int i = 0; i < size; i++) {
      data[i] = x;
    }
  }
  };

  template <typename T>
  struct DivMod {
  T div;
  T mod;

  __device__ DivMod(T _div, T _mod) {
      div = _div;
      mod = _mod;
  }
  };

  //<unsigned int>
  struct IntDivider {
  IntDivider() = default;

  __device__ inline unsigned int div(unsigned int n) const {
  unsigned int t = __umulhi(n, m1);
  return (t + n) >> shift;
  }

  __device__ inline unsigned int mod(unsigned int n) const {
  return n - div(n) * divisor;
  }

  __device__ inline DivMod<unsigned int> divmod(unsigned int n) const {
  unsigned int q = div(n);
  return DivMod<unsigned int>(q, n - q * divisor);
  }

  unsigned int divisor;  // d above.
  unsigned int m1;  // Magic number: m' above.
  unsigned int shift;  // Shift amounts.
  };

  template <int NARGS>
  struct TrivialOffsetCalculator {
    // The offset for each argument. Wrapper around fixed-size array.
    // The offsets are in # of elements, not in bytes.
    Array<unsigned int, NARGS> get(unsigned int linear_idx) const {
      Array<unsigned int, NARGS> offsets;
      #pragma unroll
      for (int arg = 0; arg < NARGS; arg++) {
        offsets[arg] = linear_idx;
      }
      return offsets;
    }
  };

  template<int NARGS>
  struct OffsetCalculator {
  OffsetCalculator() = default;
  __device__ __forceinline__ Array<unsigned int, NARGS> get(unsigned int linear_idx) const {
      Array<unsigned int, NARGS> offsets;
      #pragma unroll
      for (int arg = 0; arg < NARGS; ++arg) {
      offsets[arg] = 0;
      }

      #pragma unroll
      for (int dim = 0; dim < 25; ++dim) {
      if (dim == dims) {
          break;
      }

      auto divmod = sizes_[dim].divmod(linear_idx);
      linear_idx = divmod.div;

      #pragma unroll
      for (int arg = 0; arg < NARGS; ++arg) {
          offsets[arg] += divmod.mod * strides_[dim][arg];
      }
      //printf("offset calc thread dim size stride offset %d %d %d %d %d %d %d %d\n",
      //threadIdx.x, dim, sizes_[dim].divisor, strides_[dim][0], offsets[0], linear_idx, divmod.div, divmod.mod);
      }
      return offsets;
  }

    int dims;
    IntDivider sizes_[25];
    // NOTE: this approach will not support nInputs == 0
    unsigned int strides_[25][NARGS];
  };

  #define C10_HOST_DEVICE __host__ __device__
  #define C10_DEVICE __device__

  template <typename T>
  __device__ __forceinline__ T WARP_SHFL_DOWN(T value, unsigned int delta, int width = warpSize, unsigned int mask = 0xffffffff)
  {
    return __shfl_down_sync(mask, value, delta, width);
  }

  #if 0
  template <typename T>
  __device__ __forceinline__ std::complex<T> WARP_SHFL_DOWN(std::complex<T> value, unsigned int delta, int width = warpSize, unsigned int mask = 0xffffffff)
  {
    return std::complex<T>(
        __shfl_down_sync(mask, value.real(), delta, width),
        __shfl_down_sync(mask, value.imag(), delta, width));
  }
  #endif

  // aligned vector generates vectorized load/store on CUDA
  template<typename scalar_t, int vec_size>
  struct alignas(sizeof(scalar_t) * vec_size) aligned_vector {
    scalar_t val[vec_size];
  };

  C10_HOST_DEVICE static void reduce_fraction(size_t &numerator, size_t &denominator) {
    // get GCD of num and denom using Euclid's algorithm.
    // Can replace this with std::gcd if we ever support c++17.
    size_t a = denominator;
    size_t b = numerator;
    while (b != 0) {
        a %= b;
        // swap(a,b)
        size_t tmp = a;
        a = b;
        b = tmp;
    }

    // a is now the GCD
    numerator /= a;
    denominator /= a;
  }

  struct ReduceConfig {
  //has to match host-side ReduceConfig in the eager code
  static constexpr int BLOCK_X = 0;
  static constexpr int BLOCK_Y = 1;
  static constexpr int CTA = 2;

  static constexpr int input_vec_size = 4;
  int element_size_bytes;
  int num_inputs;
  int num_outputs;
  int step_input = 1;
  int step_output = 1;
  int ctas_per_output = 1;
  int input_mult[3] = {0, 0, 0};
  int output_mult[2] = {0, 0};

  int block_width;
  int block_height;
  int num_threads;

  bool vectorize_input = false;
  int output_vec_size = 1;

  C10_HOST_DEVICE bool should_block_x_reduce() const {
    return input_mult[BLOCK_X] != 0;
  }

  C10_HOST_DEVICE bool should_block_y_reduce() const {
    return input_mult[BLOCK_Y] != 0;
  }

  C10_HOST_DEVICE bool should_global_reduce() const {
    return input_mult[CTA] != 0;
  }

  C10_DEVICE bool should_store(int output_idx) const {
    return output_idx < num_outputs &&
      (!should_block_x_reduce() || threadIdx.x == 0) &&
      (!should_block_y_reduce() || threadIdx.y == 0);
  }

  C10_DEVICE bool should_reduce_tail() const {
    return (!should_block_y_reduce() || threadIdx.y == 0) &&
      (!should_global_reduce() || blockIdx.y == 0);
  }

  C10_HOST_DEVICE int input_idx() const {
    int lane = threadIdx.x;
    int warp = threadIdx.y;
    int cta2 = blockIdx.y;
    return (lane * input_mult[BLOCK_X] +
            warp * input_mult[BLOCK_Y] +
            cta2 * input_mult[CTA]);
  }

  template <int output_vec_size>
  C10_HOST_DEVICE int output_idx() const {
    int lane = threadIdx.x;
    int warp = threadIdx.y;
    int cta1 = blockIdx.x;
    return (lane * output_mult[BLOCK_X] +
            warp * output_mult[BLOCK_Y] +
            cta1 * step_output) * output_vec_size;
  }

  C10_DEVICE int shared_memory_offset(int offset) const {
    return threadIdx.x + (threadIdx.y + offset) * blockDim.x;
  }

  C10_DEVICE int staging_memory_offset(int cta2) const {
    int offset = cta2 + blockIdx.x * gridDim.y;
    if (!should_block_x_reduce()) {
      offset = threadIdx.x + offset * blockDim.x;
    }
    return offset;
  }

  };

//TODO this will need to be different for more generic reduction functions
namespace reducer {

  using scalar_t = int64_t;
  using arg_t = int64_t;
  using out_scalar_t = int64_t;

  inline __device__ arg_t combine(arg_t a, arg_t b) { return a * b; }

  inline __device__ out_scalar_t project(arg_t arg) {
    return (out_scalar_t) arg;
  }

  inline __device__ arg_t warp_shfl_down(arg_t arg, int offset) {
    return WARP_SHFL_DOWN(arg, offset);
  }

  inline __device__ arg_t translate_idx(arg_t acc, int64_t /*idx*/) {
    return acc;
  }

  // wrap a normal reduction that ignores the index
  inline __device__ arg_t reduce(arg_t acc, arg_t val, int64_t idx) {
     return combine(acc, val);
  }
}

struct ReduceJitOp {
  using scalar_t = int64_t;
  using arg_t = int64_t;
  using out_scalar_t = int64_t;

  using InputCalculator = OffsetCalculator<1>;
  using OutputCalculator = OffsetCalculator<2>;

//   static constexpr bool can_accumulate_in_output =
//     std::is_convertible<arg_t, out_scalar_t>::value
//     && std::is_convertible<out_scalar_t, arg_t>::value;

  static constexpr int input_vec_size = ReduceConfig::input_vec_size;

  arg_t ident;
  ReduceConfig config;
  InputCalculator input_calc;
  OutputCalculator output_calc;
  const void* src;
  const char* dst[2]; //it accepts at most two destinations
  // acc_buf used for accumulation among sub Tensor Iterator when accumulation on
  // output is not permissible
  void* acc_buf;
  // cta_buf used for accumulation between blocks during global reduction
  void* cta_buf;
  int* semaphores;
  int64_t base_idx;
  bool accumulate;
  bool final_output;
  int noutputs;

  C10_DEVICE void run() const {
    extern __shared__ char shared_memory[];
    uint32_t output_idx = config.output_idx<1>();
    uint32_t input_idx = config.input_idx();
    auto base_offsets1 = output_calc.get(output_idx)[1];

    using arg_vec_t = Array<arg_t, 1>;
    arg_vec_t value;

    if (output_idx < config.num_outputs && input_idx < config.num_inputs) {
      const scalar_t* input_slice = (const scalar_t*)((const char*)src + base_offsets1);

      value = thread_reduce<1>(input_slice);
    }

    if (config.should_block_y_reduce()) {
      value = block_y_reduce<1>(value, shared_memory);
    }
    if (config.should_block_x_reduce()) {
      value = block_x_reduce<1>(value, shared_memory);
    }

    using out_ptr_vec_t = Array<out_scalar_t*, 1>;
    using offset_vec_t = Array<uint32_t, 1>;
    offset_vec_t base_offsets;
    out_ptr_vec_t out;

    #pragma unroll
    for (int i = 0; i < 1; i++) {
      base_offsets[i] = output_calc.get(output_idx + i)[0];
      out[i] = (out_scalar_t*)((char*)dst[0] + base_offsets[i]);
    }

    arg_vec_t* acc = nullptr;
    if (acc_buf != nullptr) {
      size_t numerator = sizeof(arg_t);
      size_t denominator = sizeof(out_scalar_t);
      reduce_fraction(numerator, denominator);
      acc = (arg_vec_t*)((char*)acc_buf + (base_offsets[0] * numerator / denominator));
    }

    if (config.should_global_reduce()) {
      value = global_reduce<1>(value, acc, shared_memory);
    } else if (config.should_store(output_idx)) {
      if (accumulate) {
        #pragma unroll
        for (int i = 0; i < 1; i++) {
          value[i] = reducer::translate_idx(value[i], base_idx);
        }
      }

      if (acc == nullptr) {
        if (accumulate) {
          value = accumulate_in_output<1>(out, value);
        }
        if (final_output) {
          set_results_to_output<1>(value, base_offsets);
        } else {
          #pragma unroll
          for (int i = 0; i < 1; i++) {
            *(out[i]) = get_accumulated_output(out[i], value[i]);
          }
        }
      } else {
        if (accumulate) {
          #pragma unroll
          for (int i = 0; i < 1; i++) {
            value[i] = reducer::combine((*acc)[i], value[i]);
          }
        }
        if (final_output) {
          set_results_to_output<1>(value, base_offsets);
        } else {
          *acc = value;
        }
      }
    }
  }

  template <int output_vec_size>
  C10_DEVICE Array<arg_t, output_vec_size> thread_reduce(const scalar_t* data) const {
    if (config.vectorize_input) {
      assert(output_vec_size == 1);
      // reduce at the header of input_slice where memory is not aligned,
      // so that thread_reduce will have an aligned memory to work on.
      return {input_vectorized_thread_reduce_impl(data)};
    } else {
      uint32_t element_stride = input_calc.strides_[0][0] / sizeof(scalar_t);
      bool is_contiguous = (input_calc.dims == 1 && element_stride == 1);
      if (is_contiguous) {
        return thread_reduce_impl<output_vec_size>(data, [](uint32_t idx) { return idx; });
      } else if (input_calc.dims == 1) {
        return thread_reduce_impl<output_vec_size>(data, [&](uint32_t idx) { return idx * element_stride; });
      } else {
        return thread_reduce_impl<output_vec_size>(data, [&](uint32_t idx) { return input_calc.get(idx)[0] / sizeof(scalar_t); });
      }
    }
  }

  C10_DEVICE arg_t input_vectorized_thread_reduce_impl(const scalar_t* data) const {
    uint32_t end = config.num_inputs;

    // Handle the head of input slice where data is not aligned
    arg_t value = ident;
    constexpr int align_bytes = alignof(aligned_vector<scalar_t, input_vec_size>);
    constexpr int align_elements = align_bytes / sizeof(scalar_t);
    int shift = ((int64_t)data) % align_bytes / sizeof(scalar_t);
    if (shift > 0) {
      data -= shift;
      end += shift;
      if(threadIdx.x >= shift && threadIdx.x < align_elements && config.should_reduce_tail()){
        value = reducer::reduce(value, data[threadIdx.x], threadIdx.x - shift);
      }
      end -= align_elements;
      data += align_elements;
      shift = align_elements - shift;
    }

    // Do the vectorized reduction
    using load_t = aligned_vector<scalar_t, input_vec_size>;

    uint32_t idx = config.input_idx();
    const uint32_t stride = config.step_input;

    // Multiple accumulators to remove dependency between unrolled loops.
    arg_t value_list[input_vec_size];
    value_list[0] = value;

    #pragma unroll
    for (int i = 1; i < input_vec_size; i++) {
      value_list[i] = ident;
    }

    scalar_t values[input_vec_size];

    load_t *values_vector = reinterpret_cast<load_t*>(&values[0]);

    while (idx * input_vec_size + input_vec_size - 1 < end) {
      *values_vector = reinterpret_cast<const load_t*>(data)[idx];
      #pragma unroll
      for (uint32_t i = 0; i < input_vec_size; i++) {
        value_list[i] = reducer::reduce(value_list[i], values[i], shift + idx * input_vec_size + i);
      }
      idx += stride;
    }

    // tail
    uint32_t tail_start = end - end % input_vec_size;
    if (config.should_reduce_tail()) {
      int idx = tail_start + threadIdx.x;
      if (idx < end) {
        value_list[0] = reducer::reduce(value_list[0], data[idx], idx + shift);
      }
    }

    // combine accumulators
    #pragma unroll
    for (int i = 1; i < input_vec_size; i++) {
      value_list[0] = reducer::combine(value_list[0], value_list[i]);
    }
    return value_list[0];
  }

  template <int output_vec_size, typename offset_calc_t>
  C10_DEVICE Array<arg_t, output_vec_size> thread_reduce_impl(const scalar_t* data_, offset_calc_t calc) const {
    uint32_t idx = config.input_idx();
    const uint32_t end = config.num_inputs;
    const uint32_t stride = config.step_input;
    const int vt0=4;

    using arg_vec_t = Array<arg_t, output_vec_size>;
    using load_t = aligned_vector<scalar_t, output_vec_size>;
    const load_t* data = reinterpret_cast<const load_t*>(data_);

    // Multiple accumulators to remove dependency between unrolled loops.
    arg_vec_t value_list[vt0];

    #pragma unroll
    for (int i = 0; i < vt0; i++) {
      #pragma unroll
      for (int j = 0; j < output_vec_size; j++) {
        value_list[i][j] = ident;
      }
    }

    load_t values[vt0];

    while (idx + (vt0 - 1) * stride < end) {
      #pragma unroll
      for (uint32_t i = 0; i < vt0; i++) {
        values[i] = data[calc(idx + i * stride) / output_vec_size];
      }
      #pragma unroll
      for (uint32_t i = 0; i < vt0; i++) {
        #pragma unroll
        for (uint32_t j = 0; j < output_vec_size; j++) {
          value_list[i][j] = reducer::reduce(value_list[i][j], values[i].val[j], idx + i * stride);
        }
      }
      idx += stride * vt0;
    }

    // tail
    int idx_ = idx;
    #pragma unroll
    for (uint32_t i = 0; i < vt0; i++) {
      if (idx >= end) {
        break;
      }
      values[i] = data[calc(idx) / output_vec_size];
      idx += stride;
    }
    idx = idx_;
    #pragma unroll
    for (uint32_t i = 0; i < vt0; i++) {
      if (idx >= end) {
        break;
      }
      #pragma unroll
      for (uint32_t j = 0; j < output_vec_size; j++) {
        value_list[i][j] = reducer::reduce(value_list[i][j], values[i].val[j], idx);
      }
      idx += stride;
    }

    // combine accumulators
    #pragma unroll
    for (int i = 1; i < vt0; i++) {
      #pragma unroll
      for (uint32_t j = 0; j < output_vec_size; j++) {
        value_list[0][j] = reducer::combine(value_list[0][j], value_list[i][j]);
      }
    }
    return value_list[0];
  }
  template <int output_vec_size>
  C10_DEVICE Array<arg_t, output_vec_size> block_x_reduce(Array<arg_t, output_vec_size> value, char* shared_memory) const {
    using args_vec_t = Array<arg_t, output_vec_size>;
    int dim_x = blockDim.x;
    args_vec_t* shared = (args_vec_t*)shared_memory;
    if (dim_x > warpSize) {
      int address_base = threadIdx.x + threadIdx.y*blockDim.x;
      shared[address_base] = value;
      for (int offset = dim_x/2; offset >= warpSize; offset >>= 1) {
        __syncthreads();
        if (threadIdx.x < offset && threadIdx.x + offset < blockDim.x) {
          args_vec_t other = shared[address_base + offset];
          #pragma unroll
          for (int i = 0; i < output_vec_size; i++) {
            value[i] = reducer::combine(value[i], other[i]);
          }
          shared[address_base] = value;
        }
      }
      dim_x = warpSize;
    }

    __syncthreads();

    for (int offset = 1; offset < dim_x; offset <<= 1) {
      #pragma unroll
      for (int i = 0; i < output_vec_size; i++) {
        arg_t other = reducer::warp_shfl_down(value[i], offset);
        value[i] = reducer::combine(value[i], other);
      }
    }
    return value;
  }

  template <int output_vec_size>
  C10_DEVICE Array<arg_t, output_vec_size> block_y_reduce(Array<arg_t, output_vec_size> value, char* shared_memory) const {
    using args_vec_t = Array<arg_t, output_vec_size>;
    args_vec_t* shared = (args_vec_t*)shared_memory;
    shared[config.shared_memory_offset(0)] = value;
    for (int offset = blockDim.y / 2; offset > 0; offset >>= 1) {
      __syncthreads();
      if (threadIdx.y < offset && threadIdx.y + offset < blockDim.y) {
        args_vec_t other = shared[config.shared_memory_offset(offset)];
        #pragma unroll
        for (int i = 0; i < output_vec_size; i++) {
          value[i] = reducer::combine(value[i], other[i]);
        }
        shared[config.shared_memory_offset(0)] = value;
      }
    }
    return value;
  }

  C10_DEVICE bool mark_block_finished() const {
    __shared__ bool is_last_block_done_shared;

    __syncthreads();
    if (threadIdx.x == 0 && threadIdx.y == 0) {
      int prev_blocks_finished = atomicAdd(&semaphores[blockIdx.x], 1);
      is_last_block_done_shared = (prev_blocks_finished == gridDim.y - 1);
    }

    __syncthreads();

    return is_last_block_done_shared;
  }

  template <int output_vec_size>
  C10_DEVICE Array<arg_t, output_vec_size> accumulate_in_output(
    Array<out_scalar_t*, output_vec_size> out,
    Array<arg_t, output_vec_size> value
  ) const {
    Array<arg_t, output_vec_size> ret;
    #pragma unroll
    for (int i = 0; i < output_vec_size; i++) {
      ret[i] = reducer::combine(*(out[i]), value[i]);
    }
    return ret;
  }

  C10_DEVICE out_scalar_t get_accumulated_output(
    out_scalar_t* out, arg_t value
  ) const {
    assert(!final_output);
    return (out_scalar_t)value;
  }

  template<class T>
  C10_DEVICE void set_results(const T x, const uint32_t base_offset) const {
    assert(noutputs == 1);
    auto res = (out_scalar_t*)((char*)dst[0] + base_offset);
    *res = x;
  }

//TODO - multi-output reduction - we won't be able to use thrust::pair
//just explicitly specify typed output reads/writes
//Currently implemented for max of two outputs
//   template<class T1, class T2>
//   C10_DEVICE void set_results(const thrust::pair<T1, T2> x, const index_t base_offset) const {
//     if (noutputs >= 1) {
//       auto res0 = (T1*)((char*)dst[0] + base_offset);
//       *res0 = x.first;
//     }
//     if (noutputs >= 2) {
//       // base offset is computed assuming element size being sizeof(T1), so we need to make a
//       // correction to obtain the correct base offset
//       auto res1 = (T2*) ((char *) dst[1] + base_offset / sizeof(T1) * sizeof(T2));
//       *res1 = x.second;
//     }
//   }

  template <int output_vec_size>
  C10_DEVICE void set_results_to_output(Array<arg_t, output_vec_size> value, Array<uint32_t, output_vec_size> base_offset) const {
    assert(final_output);
    #pragma unroll
    for (int i = 0; i < output_vec_size; i++) {
      set_results(reducer::project(value[i]), base_offset[i]);
    }
  }

  template <int output_vec_size>
  C10_DEVICE Array<arg_t, output_vec_size> global_reduce(Array<arg_t, output_vec_size> value, Array<arg_t, output_vec_size> *acc, char* shared_memory) const {
    using arg_vec_t = Array<arg_t, output_vec_size>;
    using out_ptr_vec_t = Array<out_scalar_t*, output_vec_size>;
    using offset_vec_t = Array<uint32_t, output_vec_size>;

    arg_vec_t* reduce_buffer = (arg_vec_t*)cta_buf;
    uint32_t output_idx = config.output_idx<output_vec_size>();
    offset_vec_t base_offsets;
    out_ptr_vec_t out;

    #pragma unroll
    for (int i = 0; i < output_vec_size; i++) {
      base_offsets[i] = output_calc.get(output_idx + i)[0];
      out[i] = (out_scalar_t*)((char*)dst[0] + base_offsets[i]);
    }

    bool should_store = config.should_store(output_idx);
    if (should_store) {
      uint32_t offset = config.staging_memory_offset(blockIdx.y);
      reduce_buffer[offset] = value;
    }

    __threadfence(); // make sure writes are globally visible
    __syncthreads(); // if multiple warps in this block wrote to staging, make sure they're all done
    bool is_last_block_done = mark_block_finished();

    if (is_last_block_done) {
      value = ident;
      if (config.should_block_x_reduce()) {
        uint32_t input_offset = threadIdx.x + threadIdx.y * blockDim.x;
        uint32_t step = blockDim.x * blockDim.y;
        for (; input_offset < config.ctas_per_output; input_offset += step) {
          uint32_t idx = config.staging_memory_offset(input_offset);
          arg_vec_t next = reduce_buffer[idx];
          #pragma unroll
          for (int i = 0; i < output_vec_size; i++) {
            value[i] = reducer::combine(value[i], next[i]);
          }
        }
      } else {
        uint32_t input_offset = threadIdx.y;
        uint32_t step = blockDim.y;
        for (; input_offset < config.ctas_per_output; input_offset += step) {
          uint32_t idx = config.staging_memory_offset(input_offset);
          arg_vec_t next = reduce_buffer[idx];
          #pragma unroll
          for (int i = 0; i < output_vec_size; i++) {
            value[i] = reducer::combine(value[i], next[i]);
          }
        }
      }
      value = block_y_reduce(value, shared_memory);
      if (config.should_block_x_reduce()) {
        value = block_x_reduce<output_vec_size>(value, shared_memory);
      }
      if (should_store) {
        if (accumulate) {
          #pragma unroll
          for (int i = 0; i < output_vec_size; i++) {
            value[i] = reducer::translate_idx(value[i], base_idx);
          }
        }

        if (acc == nullptr) {
          if (accumulate) {
            value = accumulate_in_output<output_vec_size>(out, value);
          }
          if (final_output) {
            set_results_to_output<output_vec_size>(value, base_offsets);
          } else {
            #pragma unroll
            for (int i = 0; i < output_vec_size; i++) {
              *(out[i]) = get_accumulated_output(out[i], value[i]);
            }
          }
        } else {
          if (accumulate) {
            #pragma unroll
            for (int i = 0; i < output_vec_size; i++) {
              value[i] = reducer::combine((*acc)[i], value[i]);
            }
          }
          if (final_output) {
            set_results_to_output<output_vec_size>(value, base_offsets);
          } else {
            *acc = value;
          }
        }
      }
    }

    return value;
  }
};

extern "C"
__launch_bounds__(512, 4)
__global__ void reduction_prod_kernel(ReduceJitOp r){
  r.run();
}
nvrtc: error: invalid value for --gpu-architecture (-arch)

My GPU is a Nvidia RTX 4090. How can I get rid of this problem? TIA

hkchengrex commented 5 months ago

The warning means that the detector cannot detect anything. Can you try re-installing Grounded-DINO/SAM? It seems like a problem on either one of these dependencies.

Masrur02 commented 5 months ago

And how about the log during the demo_text? Okay I am re-installing Grounded-DINO/SAM.

Masrur02 commented 5 months ago

I believe Grounding-DINO/SAM is installed properly.

# packages in environment at /home/soicroot/miniconda3/envs/dev:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
absl-py                   2.1.0                    pypi_0    pypi
accelerate                0.30.1                   pypi_0    pypi
addict                    2.4.0                    pypi_0    pypi
aiofiles                  23.2.1                   pypi_0    pypi
aliyun-python-sdk-core    2.15.1                   pypi_0    pypi
aliyun-python-sdk-kms     2.16.3                   pypi_0    pypi
altair                    5.3.0                    pypi_0    pypi
annotated-types           0.6.0                    pypi_0    pypi
anyio                     4.3.0                    pypi_0    pypi
asttokens                 2.4.1                    pypi_0    pypi
attrs                     23.2.0                   pypi_0    pypi
beautifulsoup4            4.12.3                   pypi_0    pypi
blas                      1.0                         mkl  
boto3                     1.34.105                 pypi_0    pypi
botocore                  1.34.105                 pypi_0    pypi
brotli-python             1.0.9            py39h6a678d5_8  
bzip2                     1.0.8                h5eee18b_6  
ca-certificates           2024.3.11            h06a4308_0  
certifi                   2024.2.2         py39h06a4308_0  
cffi                      1.16.0                   pypi_0    pypi
charset-normalizer        2.0.4              pyhd3eb1b0_0  
chumpy                    0.70                     pypi_0    pypi
click                     8.1.7                    pypi_0    pypi
colorama                  0.4.6                    pypi_0    pypi
coloredlogs               15.0.1                   pypi_0    pypi
comm                      0.2.2                    pypi_0    pypi
contourpy                 1.2.1                    pypi_0    pypi
crcmod                    1.7                      pypi_0    pypi
cryptography              42.0.7                   pypi_0    pypi
cudatoolkit               11.3.1               h2bc3f7f_2  
cycler                    0.12.1                   pypi_0    pypi
cython                    3.0.10                   pypi_0    pypi
debugpy                   1.8.1                    pypi_0    pypi
decorator                 5.1.1                    pypi_0    pypi
defusedxml                0.7.1                    pypi_0    pypi
deva                      1.0.0                    pypi_0    pypi
diffusers                 0.27.2                   pypi_0    pypi
dnspython                 2.6.1                    pypi_0    pypi
easydict                  1.13                     pypi_0    pypi
einops                    0.8.0                    pypi_0    pypi
email-validator           2.1.1                    pypi_0    pypi
exceptiongroup            1.2.1                    pypi_0    pypi
executing                 2.0.1                    pypi_0    pypi
fairscale                 0.4.4                    pypi_0    pypi
fastapi                   0.111.0                  pypi_0    pypi
fastapi-cli               0.0.3                    pypi_0    pypi
ffmpeg                    4.3                  hf484d3e_0    pytorch
ffmpy                     0.3.2                    pypi_0    pypi
filelock                  3.14.0                   pypi_0    pypi
filterpy                  1.4.5                    pypi_0    pypi
flatbuffers               24.3.25                  pypi_0    pypi
fonttools                 4.51.0                   pypi_0    pypi
freetype                  2.12.1               h4a9f257_0  
freetype-py               2.4.0                    pypi_0    pypi
fsspec                    2024.5.0                 pypi_0    pypi
gdown                     5.2.0                    pypi_0    pypi
gitdb                     4.0.11                   pypi_0    pypi
gitpython                 3.1.43                   pypi_0    pypi
gmp                       6.2.1                h295c915_3  
gnutls                    3.6.15               he1e5248_0  
gradio                    4.31.2                   pypi_0    pypi
gradio-client             0.16.3                   pypi_0    pypi
groundingdino             0.1.0                     dev_0    <develop>
grpcio                    1.63.0                   pypi_0    pypi
gurobipy                  11.0.2                   pypi_0    pypi
h11                       0.14.0                   pypi_0    pypi
h5py                      3.11.0                   pypi_0    pypi
hickle                    5.0.3                    pypi_0    pypi
httpcore                  1.0.5                    pypi_0    pypi
httptools                 0.6.1                    pypi_0    pypi
httpx                     0.27.0                   pypi_0    pypi
huggingface-hub           0.23.0                   pypi_0    pypi
humanfriendly             10.0                     pypi_0    pypi
idna                      3.7              py39h06a4308_0  
imageio                   2.34.1                   pypi_0    pypi
importlib-metadata        7.1.0                    pypi_0    pypi
importlib-resources       6.4.0                    pypi_0    pypi
intel-openmp              2023.1.0         hdb19cb5_46306  
ipykernel                 6.29.4                   pypi_0    pypi
ipython                   8.18.1                   pypi_0    pypi
jedi                      0.19.1                   pypi_0    pypi
jinja2                    3.1.4                    pypi_0    pypi
jmespath                  0.10.0                   pypi_0    pypi
joblib                    1.4.2                    pypi_0    pypi
jpeg                      9e                   h5eee18b_1  
json-tricks               3.17.3                   pypi_0    pypi
jsonschema                4.22.0                   pypi_0    pypi
jsonschema-specifications 2023.12.1                pypi_0    pypi
jupyter-client            8.6.1                    pypi_0    pypi
jupyter-core              5.7.2                    pypi_0    pypi
kiwisolver                1.4.5                    pypi_0    pypi
lame                      3.100                h7b6447c_0  
lazy-loader               0.4                      pypi_0    pypi
lcms2                     2.12                 h3be6417_0  
ld_impl_linux-64          2.38                 h1181459_1  
lerc                      3.0                  h295c915_0  
libdeflate                1.17                 h5eee18b_1  
libffi                    3.4.4                h6a678d5_1  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libiconv                  1.16                 h5eee18b_3  
libidn2                   2.3.4                h5eee18b_0  
libpng                    1.6.39               h5eee18b_0  
libstdcxx-ng              11.2.0               h1234567_1  
libtasn1                  4.19.0               h5eee18b_0  
libtiff                   4.5.1                h6a678d5_0  
libunistring              0.9.10               h27cfd23_0  
libwebp-base              1.3.2                h5eee18b_0  
llvmlite                  0.42.0                   pypi_0    pypi
lz4-c                     1.9.4                h6a678d5_1  
markdown                  3.6                      pypi_0    pypi
markdown-it-py            3.0.0                    pypi_0    pypi
markupsafe                2.1.5                    pypi_0    pypi
matplotlib                3.8.4                    pypi_0    pypi
matplotlib-inline         0.1.7                    pypi_0    pypi
mdurl                     0.1.2                    pypi_0    pypi
mkl                       2023.1.0         h213fc3f_46344  
mkl-service               2.4.0            py39h5eee18b_1  
mkl_fft                   1.3.8            py39h5eee18b_0  
mkl_random                1.2.4            py39hdb19cb5_0  
mmcv-full                 1.7.1                    pypi_0    pypi
mmpose                    0.28.0                   pypi_0    pypi
model-index               0.1.11                   pypi_0    pypi
mpmath                    1.3.0                    pypi_0    pypi
munkres                   1.1.4                    pypi_0    pypi
ncurses                   6.4                  h6a678d5_0  
nest-asyncio              1.6.0                    pypi_0    pypi
nettle                    3.7.3                hbbd107a_1  
networkx                  3.2.1                    pypi_0    pypi
numba                     0.59.1                   pypi_0    pypi
numpy                     1.26.4           py39h5f9d8c6_0  
numpy-base                1.26.4           py39hb5e798b_0  
onnx                      1.16.0                   pypi_0    pypi
onnxruntime               1.17.3                   pypi_0    pypi
opencv-python             4.9.0.80                 pypi_0    pypi
opencv-python-headless    4.9.0.80                 pypi_0    pypi
opendatalab               0.0.10                   pypi_0    pypi
openh264                  2.1.1                h4ff587b_0  
openjpeg                  2.4.0                h3ad879b_0  
openmim                   0.3.9                    pypi_0    pypi
openssl                   3.0.13               h7f8727e_1  
openxlab                  0.0.38                   pypi_0    pypi
ordered-set               4.1.0                    pypi_0    pypi
orjson                    3.10.3                   pypi_0    pypi
oss2                      2.17.0                   pypi_0    pypi
packaging                 24.0                     pypi_0    pypi
pandas                    2.2.2                    pypi_0    pypi
parso                     0.8.4                    pypi_0    pypi
pexpect                   4.9.0                    pypi_0    pypi
pillow                    10.3.0           py39h5eee18b_0  
pip                       24.0             py39h06a4308_0  
platformdirs              4.2.2                    pypi_0    pypi
plyfile                   1.0.3                    pypi_0    pypi
prompt-toolkit            3.0.43                   pypi_0    pypi
protobuf                  5.26.1                   pypi_0    pypi
psutil                    5.9.8                    pypi_0    pypi
ptyprocess                0.7.0                    pypi_0    pypi
pulp                      2.8.0                    pypi_0    pypi
pure-eval                 0.2.2                    pypi_0    pypi
pycocoevalcap             1.2                      pypi_0    pypi
pycocotools               2.0.7                    pypi_0    pypi
pycparser                 2.22                     pypi_0    pypi
pycryptodome              3.20.0                   pypi_0    pypi
pydantic                  2.7.1                    pypi_0    pypi
pydantic-core             2.18.2                   pypi_0    pypi
pydub                     0.25.1                   pypi_0    pypi
pyglet                    1.5.27                   pypi_0    pypi
pygments                  2.18.0                   pypi_0    pypi
pyopengl                  3.1.0                    pypi_0    pypi
pyparsing                 3.1.2                    pypi_0    pypi
pyrender                  0.1.45                   pypi_0    pypi
pysocks                   1.7.1            py39h06a4308_0  
python                    3.9.19               h955ad1f_1  
python-dateutil           2.9.0.post0              pypi_0    pypi
python-dotenv             1.0.1                    pypi_0    pypi
python-multipart          0.0.9                    pypi_0    pypi
pytorch                   1.12.0          py3.9_cuda11.3_cudnn8.3.2_0    pytorch
pytorch-mutex             1.0                        cuda    pytorch
pytz                      2023.4                   pypi_0    pypi
pyyaml                    6.0.1                    pypi_0    pypi
pyzmq                     26.0.3                   pypi_0    pypi
readline                  8.2                  h5eee18b_0  
referencing               0.35.1                   pypi_0    pypi
regex                     2024.5.15                pypi_0    pypi
requests                  2.28.2                   pypi_0    pypi
rich                      13.4.2                   pypi_0    pypi
rpds-py                   0.18.1                   pypi_0    pypi
ruff                      0.4.4                    pypi_0    pypi
s3transfer                0.10.1                   pypi_0    pypi
sacremoses                0.1.1                    pypi_0    pypi
safetensors               0.4.3                    pypi_0    pypi
scikit-image              0.22.0                   pypi_0    pypi
scikit-learn              1.4.2                    pypi_0    pypi
scipy                     1.13.0                   pypi_0    pypi
segment-anything          1.0                       dev_0    <develop>
semantic-version          2.10.0                   pypi_0    pypi
setuptools                60.2.0                   pypi_0    pypi
shellingham               1.5.4                    pypi_0    pypi
six                       1.16.0                   pypi_0    pypi
smmap                     5.0.1                    pypi_0    pypi
smplx                     0.1.28                   pypi_0    pypi
sniffio                   1.3.1                    pypi_0    pypi
soupsieve                 2.5                      pypi_0    pypi
sqlite                    3.45.3               h5eee18b_0  
stack-data                0.6.3                    pypi_0    pypi
starlette                 0.37.2                   pypi_0    pypi
supervision               0.20.0                   pypi_0    pypi
sympy                     1.12                     pypi_0    pypi
tabulate                  0.9.0                    pypi_0    pypi
tbb                       2021.8.0             hdb19cb5_0  
tensorboard               2.16.2                   pypi_0    pypi
tensorboard-data-server   0.7.2                    pypi_0    pypi
tensorboardx              2.6.2.2                  pypi_0    pypi
thinplate                 1.0.0                    pypi_0    pypi
threadpoolctl             3.5.0                    pypi_0    pypi
tifffile                  2024.5.10                pypi_0    pypi
timm                      0.4.12                   pypi_0    pypi
tk                        8.6.14               h39e8969_0  
tokenizers                0.10.3                   pypi_0    pypi
tomli                     2.0.1                    pypi_0    pypi
tomlkit                   0.12.0                   pypi_0    pypi
toolz                     0.12.1                   pypi_0    pypi
torchgeometry             0.1.2                    pypi_0    pypi
torchvision               0.13.0               py39_cu113    pytorch
tornado                   6.4                      pypi_0    pypi
tqdm                      4.65.2                   pypi_0    pypi
traitlets                 5.14.3                   pypi_0    pypi
transformers              4.15.0                   pypi_0    pypi
trimesh                   4.3.2                    pypi_0    pypi
typer                     0.12.3                   pypi_0    pypi
typing_extensions         4.11.0           py39h06a4308_0  
tzdata                    2024.1                   pypi_0    pypi
ujson                     5.10.0                   pypi_0    pypi
urllib3                   1.26.18                  pypi_0    pypi
uvicorn                   0.29.0                   pypi_0    pypi
uvloop                    0.19.0                   pypi_0    pypi
watchfiles                0.21.0                   pypi_0    pypi
wcwidth                   0.2.13                   pypi_0    pypi
websockets                11.0.3                   pypi_0    pypi
werkzeug                  3.0.3                    pypi_0    pypi
wheel                     0.43.0           py39h06a4308_0  
xtcocotools               1.14.3                   pypi_0    pypi
xz                        5.4.6                h5eee18b_1  
yacs                      0.1.8                    pypi_0    pypi
yapf                      0.40.2                   pypi_0    pypi
zipp                      3.18.1                   pypi_0    pypi
zlib                      1.2.13               h5eee18b_1  
zstd                      1.5.5                hc292b87_2   

These are the packages installed in my conda env. Still same error

Masrur02 commented 5 months ago

I have observed that this error is related to GPU RTX 4090 due to sm89 architecture, in RTX 2060 I did not get this error

hkchengrex commented 5 months ago

Can it be a compilation error? If it is compiled for 2060 it would not work for different SM

hkchengrex commented 5 months ago

Please feel free to re-open if there are any updates.

smileyenot983 commented 2 months ago

Was someone able to solve this? I am having same error on 4090

Masrur02 commented 2 months ago

Can you please check if any other process is running on your GPU?

smileyenot983 commented 2 months ago

Can you please check if any other process is running on your GPU?

Here is the output of nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:01:00.0 Off |                  Off |
|  0%   43C    P8              27W / 450W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off | 00000000:03:00.0 Off |                  Off |
|  0%   42C    P8              38W / 450W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
smileyenot983 commented 2 months ago

Ok, looks like everything got fixed after googling. I was able to solve it by updating the pytorch and torchvision to 2.1.2: pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 --extra-index-url https://download.pytorch.org/whl/cu118 Hope this will help to someone too